Empirical Bayes mean estimation with nonparametric errors via order statistic regression on replicated data

We study empirical Bayes estimation of the effect sizes of $N$ units from $K$ noisy observations on each unit. We show that it is possible to achieve near-Bayes optimal mean squared error, without any assumptions or knowledge about the effect size distribution or the noise. The noise distribution can be heteroskedastic and vary arbitrarily from unit to unit. Our proposal, which we call Aurora, leverages the replication inherent in the $K$ observations per unit and recasts the effect size estimation problem as a general regression problem. Aurora with linear regression provably matches the performance of a wide array of estimators including the sample mean, the trimmed mean, the sample median, as well as James-Stein shrunk versions thereof. Aurora automates effect size estimation for Internet-scale datasets, as we demonstrate on data from a large technology firm.


Introduction
Empirical Bayes (EB) [Efron, 2012, Robbins, 1956, 1964] and related shrinkage methods are the de facto standard for estimating effect sizes in many disciplines.In genomics, EB is used to detect differentially expressed genes when the number of samples is small [Smyth, 2004, Love et al., 2014].In survey sampling, EB improves noisy estimates of quantities, like the average income, for small communities [Rao and Molina, 2015].The key insight of EB is that one can often estimate unit-level quantities better by sharing information across units, rather than analyzing each unit separately.
Formally, EB models the observed data Z " pZ 1 , ..., Z N q as arising from the following generative process: The goal here is to estimate the mean parameters, µ i :" E F rZ i | µ i s for i " 1, ..., N , from the observed data Z.If G and F are fully specified, then the optimal estimator (in the sense of mean squared error) is the posterior mean E G,F " µ i ˇˇZ i ‰ , which achieves the Bayes risk.Empirical Bayes deals with the case where F or G is unknown, so the Bayes rule cannot be calculated.Most modern EB methods [Jiang and Zhang, 2009, Brown and Greenshtein, 2009, Muralidharan, 2012, Saha and Guntuboyina, 2020] assume that F is known (say, F p¨ˇˇµ i q " N pµ i , 1q) and construct estimators μi that asymptotically match the risk of the unknown Bayes rule, without making any assumptions about the unknown prior G.
We examine the same problem of estimating the µ i s when the likelihood F is also unknown.Indeed, knowledge of F is an assumption that requires substantial domain expertise.For example, it took many years for the genomics community to agree on an EB model for detecting differences in gene expression based on microarray data [Baldi and Long, 2001, Lönnstedt and Speed, 2002, Smyth, 2004].Then, once this technology was superseded by bulk RNA-Seq, the community had to devise a new model from scratch, eventually settling on the negative binomial likelihood [Love et al., 2014, Gierliński et al., 2015].
Unfortunately, there is no way to avoid making such strong assumptions when there is no information besides the one Z i per µ i .If F is even slightly underspecified, then it becomes hopeless to disentangle F from G. To appreciate the problem, consider the Normal-Normal model: G " N p0, Aq F p¨| µ i q " N pµ i , σ 2 q. (2) Here, Z i is marginally distributed as N p0, A `σ2 q, and the observations Z i only provide information about A `σ2 .Now, when σ 2 is known, A can be estimated by first estimating the marginal variance and subtracting σ 2 .Indeed, Efron and Morris [1973] showed that by plugging in a particular estimate of A into the Bayes rule E G,F " µ i ˇˇZ i ‰ " p1 ´σ2 σ 2 `A qZ i , one recovers the celebrated James-Stein estimator [James and Stein, 1961].Yet, as soon as σ 2 is unknown, then A (and hence, G) is unidentified, and there is no hope of approximating the unknown Bayes rule.
However, as any student of random effects knows, the Normal-Normal model (2) becomes identifiable if we simply have independent replicates Z ij for each unit i.The driving force behind this work is an analogous observation in the context of empirical Bayes estimation: replication makes it possible to estimate µ i with no assumptions on F or G whatsoever.The method we propose, described in the next section, performs well in practice and nearly matches the risk of the Bayes rule, which depends on the unknown F and G.

The Aurora Method
First, we formally specify the EB model when replicates Z ij P R are available.
iid " F p¨| µ i , α i q j " 1, . . ., K. (3) Again, the quantity of interest is the mean parameter µ i :" E F rZ ij | µ i , α i s.The additional parameter α i is a nuisance parameter that allows for heterogeneity across the units, while preserving exchangeability [Galvao andKato, 2014, Okui andYanagi, 2020].For example, α i is commonly taken to be the conditional variance σ 2 i :" Var F rZ ij | µ i , α i s to allow for heteroskedasticity.However, α i could even be infinite-dimensional-for instance, a random we draw µi " G, where G " N p0.5, 4q in the left panel and G is the uniform distribution on the discrete set t´3, 0, 3u in the right panel.Then, for each i, we draw Xi, Yi ˇˇµi iid " N pµi, 1q and plot the points pXi, Yiq.The line shows the posterior mean E " µi ˇˇXi ‰ , which in light of (4), is identical to the conditional mean E " Yi ˇˇXi ‰ .
element from a space of distributions.The α i have no impact on our estimation strategy and are purely a technical device.
Given data from model (3), one approach would be to collapse the replicates into a single observation per unit-say, by taking their mean-which would bring us back to the setting of model (1).We could appeal to the Central Limit Theorem to justify knowing that the likelihood is Normal.An important message of this paper is that we can do better by using the replicates.

Proposed Method
Since we have replicates, we can split the Z ij into two groups for each i.First, consider the case of K " 2 replicates so that we can write pX i , Y i q for pZ i1 , Z i2 q.Now, X i and Y i are conditionally independent given pµ i , α i q. Figure 1 illustrates the relationship between X i and Y i under two different settings.The key insight is that the conditional mean E G,F " Y i ˇˇX i ‰ is (almost surely) identical to the posterior mean E G,F " µ i ˇˇX i ‰ , by the following elementary calculation [Krutchkoff, 1967].(For convenience, we suppress the dependence of the expected values on G, F .) This suggests that we can estimate the Bayes rule based on X i (i.e., the posterior mean Erµ i ˇˇX i s) by simply regressing Y i on X i using any black-box predictive model, such as a local averaging smoother.Let mp¨q be the fitted regression function; our estimate of each µ i is then just μi " mpX i q.
To extend this method to K ą 2, we can again split the replicates Z i into two parts: X i :" X i pjq :" pZ i1 , . . ., Z ipj´1q , Z ipj`1q , . . ., Z iK q Y i :" Y i pjq :" Z ij , where j P t1, . . ., Ku is an arbitrary fixed index for now.(We suppress j in the notation below.)Now, one approach is to summarize the vector X i by the mean of its values Xi and regress Y i on Xi to learn E " µ i ˇˇs X i ‰ [Coey and Cunningham, 2019].This works for essentially Aurora: "Averages of Units by Regressing on Ordered Replicates Adaptively."For j P t1, . . ., Ku 1. Split the replicates for each unit, Z i , into X i :" pZ i1 , . . ., Z ipj´1q , Z ipj`1q , . . ., Z iK q and Y i :" Z ij , as in (5). 2. For each X i , order the values to obtain X p¨q i .
3. Regress Y i on X p¨q i using any black-box predictive model.Let mj be the fitted regression function.4. Let μAur i,j :" mj pX p¨q i q. end Estimate each µ i by μAur Table 1: A summary of Aurora, which is the proposed method for estimating the means µi when the data come from model (3).
the same reason as the K " 2 case.However, unless Xi is sufficient for pµ i , α i q in model (3), then E " µ i ˇˇs X i ‰ will be different from and suboptimal to E " The rationale is contained in the following result.
Proposition 1.Let Z ij be generated according to (3) and assume that Er|µ i |s ă 8, Er|Z ij |s ă 8. Define X i and Y i as in (5).Let X p¨q i be the vector of order statistics of X i : Proof.The same argument as (4) shows that ErY i | X p¨q i s " Erµ i | X p¨q i s almost surely.Now, under exchangeable sampling, the order statistics X p¨q i are sufficient for pµ i , α i q and therefore it follows that Erµ i | X p¨q i s " Erµ i | X i s almost surely, with no assumptions on F .Equation ( 6) suggests that we should regress Y i on the order statistics X p¨q i to learn a function mp¨q : R K´1 Ñ R that approximates the posterior mean.The fitted values mpX p¨q i q can be used to estimate µ i .There is one more detail worth mentioning.The estimate mpX p¨q i q depends on the arbitrary choice of j in (5).To reduce the variance of the estimate, we average over all choices of j P t1, . . ., Ku.This method, summarized in Table 1, is called Aurora, which stands for "Averages of Units by Regressing on Ordered Replicates Adaptively."

Related work
The Aurora method is closely related to the extensive literature on empirical Bayes, which we cite throughout this paper.One work that is worth emphasizing is Stigler [1990], who motivated empirical Bayes estimators, like James-Stein, through the lens of regression to the mean; for us, regression is not just a motivation but the estimation strategy itself.We were surprised to find a similar idea to Aurora in a forgotten manuscript, uncited to date, that Johns [1986] contributed to a symposium for Herbert Robbins [Van Ryzin, 1986].Although Johns [1986] used a fairly complex predictive model (projection pursuit regression [Friedman and Stuetzle, 1981]), we show theoretically (Sections 4,5) and empirically (Sections 6,7) that the order statistics encode enough structure that linear regression and k-nearest neighbor regression can be used as the predictive model.
Models similar to (3) with replicated noisy measurements of unobservable random quantities have been studied in the context of deconvolution and error-in-variables regression [Devanarayan andStefanski, 2002, Schennach, 2004].In econometrics, panel data with random effects are often modelled as in (3) with the additional potential complication of time dependence, i.e., j " 1, . . ., K indexes time, while i " 1, . . ., N may correspond to different geographic regions [Horowitz and Markatou, 1996, Hall and Yao, 2003, Neumann, 2007, Jochmans and Weidner, 2018].Fithian and Ting [2017] use observations from model (3) to learn a lowdimensional smooth parametric model that describes the data well.However, their goal is testing, while Aurora is geared for mean estimation.
The Aurora method is also related to a recent line of research that leverages black-box prediction methods to solve statistical tasks that are not predictive in nature.For example, Chernozhukov et al. [2017] consider inference for low dimensional causal quantities when high dimensional nuisance components are estimated by machine learning.Boca and Leek [2018] reinterpret the multiple testing problem in the presence of informative covariates as a regression problem and estimate the proportion of null hypotheses conditionally on the covariates.The estimated conditional proportion of null hypotheses may then be used for downstream multiple testing methods [Ignatiadis and Huber, 2018].Black-box regression models can also be used to improve empirical Bayes point estimates in the presence of sideinformation [Ignatiadis and Wager, 2019].
Finally, a crucial ingredient in Aurora is data splitting, which is a classical idea in statistics [Cox, 1975], typically used to ensure honest inference for low dimensional parameters.In the context of simultaneous inference, Rubin, Dudoit, and Van der Laan [2006] and Habiger and Peña [2014] use data-splitting to improve power in multiple testing.

Properties of the Aurora estimator
We provide theoretical guarantees for the general Aurora estimator described in Section 2. Throughout this section we split Z i into X i pjq and Y i pjq as in (5), and we write X p¨q i pjq for the order statistics of X i pjq.We also write Z and X p¨q pjq for the concatenation of all the Z i and X p¨q i pjq, respectively.As before, we omit j whenever a definition does not depend on the specific choice of j.

Three oracle benchmarks
In this section we define three benchmarks for assessing the quality of a mean estimator in model (3) (in terms of mean squared error).These benchmarks provide the required context to interpret the theoretical guarantees of Aurora that we establish in Section 4.2.
First, it is impossible to improve on the Bayes rule, so the Bayes risk serves as an oracle.We denote the Bayes risk (based on all K replicates) by As explained in Proposition 1, the function we seek to mimic (for each j in the loop of the Aurora algorithm in Table 1) is the oracle Bayes rule based on the order statistics of K ´1 replicates, The risk of m ˚p¨q is the Bayes risk based on K ´1 replicates, In the Aurora algorithm we average over the choice of held-out replicate j.Thus, we also define an oracle rule that averages m ˚pX p¨q i pjqq over all j.We use the following notation for this oracle and its risk: It is immediate that R K pG, F q ď R K´1 pG, F q. Also, by the definition of m ˚in (10) and by Jensen's inequality, it holds that, so R K´1 pG, F q ď R K´1 pG, F q.This bound can be improved through the following insight: m ˚pX p¨q i pjqq, j " 1, . . ., K are jackknife estimates of the posterior mean E rµ i | Z i s and m ˚pZ i q is their average.By a fundamental result for the jackknife [Efron and Stein, 1981, Theorem 2], it holds that1 Varrm ˚pZ i q | µ i , α i s ď Varrm ˚pX p¨q i q | µ i , α i s ¨pK ´1q{K.Armed with this insight, in Supplement A.1, we prove that: Remark 1.For sufficiently "regular" problems, R K pG, F q will typically be of order Op1{Kq, while R K´1 pG, F q ´RK pG, F q will be of order Op1{K2 q; we make this argument rigorous for location families in Section 5.2 and Supplement E. The correction term on the right hand side of (11) will also be of order Op1{K 2 q in such problems.In our next example, we demonstrate the importance of this additional term.
Example 1 (Normal likelihood with Normal prior).We consider the Normal-Normal model (2) from the introduction with prior variance A and noise variance σ 2 " 1, and with replicates Z ij , j " 1, . . ., K as in (3).In this case, 2 we can analytically compute that R K pG, F q " A{pKA `1q, and so it follows that R K´1 pG, F q ´RK pG, F q " Θp1{K 2 q.The right-most inequality in ( 11) is an equality and R K´1 pG, F q ´RK pG, F q " Θp1{K 4 q.We conclude that, in the Normal-Normal model, the averaged oracle m ˚has risk closer to the full Bayes estimator than to the Bayes estimator based on K ´1 replicates.

Regret bound for Aurora
In this section we derive our main regret bound for Aurora.The basic idea is that the in-sample prediction error of the regression method m directly translates to bounds on the estimation error for the effect sizes µ i , and thus, Aurora can leverage black-box predictive models.To this end, we define the in-sample prediction error for estimating the averaged oracle m ˚, Err pm ˚, mq :" 1 When the predictive mechanism of the Aurora algorithm (Table 1) is the same for all j, 3 then Jensen's inequality provides the following convenient upper bound on (12): Err pm ˚, mq ď Err pm ˚, mq :" 1 Err pm ˚, mq is the in-sample error from approximating m ˚by m.We are ready to state our main result: Theorem 3. The mean squared error of the Aurora estimator μAur i (described in Table 1) satisfies the following regret bound under model `2 ´RK ´1pG, F q ´RK pG, F q ¯(Error due to data splitting) `2Err pm ˚, mq (Estimation error) μAur i,j (i.e., the Aurora estimator based on a single held-out response replicate) satisfies the above regret bound with R K´1 pG, F q, Err pm ˚, mq replaced by R K´1 pG, F q, Err pm ˚, mq.
As elaborated in Remark 1, the second error term in the above decomposition will typically be negligible compared to R K pG, F q; this is the price we pay for making no assumptions about F and G. Hence, beyond the irreducible Bayes error, the main source of error depends on how well we can estimate m ˚p¨q.Crucially, this error is the in-sample estimation error of m, which is often easier to analyze and smaller in magnitude than out-of-sample estimation error [Hastie et al., 2008, Chatterjee, 2013, Rosset and Tibshirani, 2018].

Aurora with k-Nearest-Neighbors is universally consistent
Theorem 3 demonstrated that a regression model with small in-sample error translates, through Aurora, to mean estimates with small mean squared error.Now, we combine this result with results from nonparametric regression to prove that under model (3), it is possible to asymptotically (in N ) match the oracle risk (10) and in view of Proposition 2, to outperform the Bayes risk based on K ´1 replicates.
3 That is, the map pX p¨q i pjq, Y i pjqq i"1,...,N Þ Ñ mj is the same for all j.This does not imply that mj " mj 1 for j ‰ j 1 , since the training data changes.
Theorem 4 (Universal consistency with k-Nearest-Neighbor (kNN) estimator).Consider model (3) with Erµ 2 i s ă 8, ErZ 2 ij s ă 8.We estimate µ i with the Aurora algorithm where mp¨q is the k-Nearest-Neighbor (kNN) estimator with k " k N P N, i.e., the nonparametric regression estimator which predicts4 mpxq " Yi, where S k pxq " # i P t1, . . ., N u : and ¨ 2 is the Euclidean distance.If k " k N satisfies k Ñ 8, k{N Ñ 0 as N Ñ 8, then: This result is a consequence of universal consistency in nonparametric regression [Stone, 1977, Györfi, Kohler, Krzyzak, andWalk, 2006].It demonstrates that Aurora can asymptotically match the Bayes risk with substantial generality,5 and suggests the power and expressivity of the Aurora algorithm.
To apply Aurora with kNN, a data-driven choice of k is required.In Supplement B.2 we describe a procedure that chooses the number of nearest neighbors k j per held-out response replicate j through leave-one-out (LOO) cross-validation, and we study its performance empirically in the simulations of Section 6.Nevertheless, Aurora-kNN tuned by LOO, is computationally involved.This motivates our next section; there we study a procedure that is interpretable, easy to implement and that scales well to large datasets.

Aurora with Linear Regression
In this section we analyze the Aurora algorithm when linear regression is used as the predictive model.That is, m is a linear function of the order statistics mpX p¨q i q " β0 `K´1 ÿ where β are the ordinary least squares coefficients of the linear regression Y " X p¨q .We call the method Auroral, with the final "l" signifying "linear" and write μAurL i for the resulting estimates of µ i .Our main result is that Auroral matches the performance of the best estimator that is linear in the order statistics based on K ´1 replicates.To state this result formally, we first define the minimum risk among estimators in class C: The class of interest to us specifically is the class of estimators linear in the order statistics: Lin :" Lin `RK´1 ˘:" This is a broad class that includes all estimators of µ i based on X p¨q i that proceed through the following two steps: (1) summarization, where an appropriate summary statistic of the likelihood F is used, that is linear in the order statistics, and (2) linear shrinkage, where the summary statistic is linearly shrunk towards a fixed location.Schematically: Step 1 (Summarization) :

Sample mean
Sample median Trimmed mean . . .
Step 2 (Linear shrinkage) : T pX i q Þ Ñ αT pX i q `γ, (e.g.James-Stein shrinkage) The summarization step can apply non-linear functions of the original data, such as the median and the trimmed mean, because the input to the linear model are the order statistics, rather than the original data.Such linear combinations of the order statistics are known as L-statistics, which is a class so large that it includes efficient estimators of the mean in any smooth, symmetric location family, cf.Van der Vaart [2000] and Section 5.2.We now show that Auroral matches R Lin K´1 pG, F q asymptotically in N .Theorem 5 (Regret over linear estimators).
If m ˚P Lin `RK´1 ˘, then the conclusion of Theorem 3 holds for μAurL with the term Err pm ˚, mq bounded by C N K{N (under Assumption (i)), resp.Γ a K{N (under (ii)).
In datasets, we typically encounter K !N .Then, Theorem 5 implies that Auroral will almost match the risk R Lin K´1 pGq of the best L-statistic based on K ´1 replicates.This result is in the spirit of retricted empirical Bayes [Griffin and Krutchkoff, 1971, Maritz, 1974, Norberg, 1980, Robbins, 1983], which seeks to find the best estimator among estimators in a given class, such as the best Bayes linear estimators of Hartigan [1969].In these works linearity typically refers to linearity in X i and not X p¨q i ; Lwin [1976] however uses empirical Bayes to learn the best L-statistic from the class Lin, when the likelihood takes the form of a known location-scale family.
6 By the law of total variance, almost surely, it holds that Var Thus, for example, when µ i , σ 2 i have bounded support, there exists C ą 0 such that Var ı ď C almost surely and so, the stated assumption holds with C N " C.

Examples of Auroral estimation
In this section we give three examples in which Auroral satisfies strong risk guarantees.
Example 2 (Point mass prior).Suppose the prior on µ is a point mass at μ; that is, P G rµ i " μs " 1.Then, the Bayes rule based on the order statistics, m ˚pX p¨q i q " μ, has risk 0 and is trivially a member of Lin `RK´1 ˘, so R Lin K´1 pG, F q " 0. Therefore, by Theorem 5, provided that Example 3 (Normal likelihood with Normal prior).We revisit Example 1, 7 wherein we argued that the averaged oracle m ˚pZ i q (10) has risk equal to the Bayes risk R K pG, F q plus Op1{K 4 q.Auroral attains this risk, plus the linear least squares estimation error that decays as OpK{N q.
In addition to Auroral, we consider two estimators, that take advantage of the fact that Zi :" • James and Stein [1961] CC-L implicitly uses the assumption of Normal likelihood by reducing the replicates to their mean, which is the sufficient statistic.Similar to Auroral, its risk is equal to R K pG, F q plus Op1{K 4 q and the least squares estimation error, which in this case is Op1{N q (only the slope and intercept need to be estimated for each held-out replicate).James-Stein makes full use of the model assumptions (Normal prior, Normal likelihood, known variance) and is expected to perform best.JS achieves the Bayes risk based on K observations, R K pG, F q plus an error term that decays as O `1{pN K 2 q ˘.The price Auroral pays compared to CC-L and JS for using no assumptions whatsoever, when reduction to the mean was possible, is modest.
Example 4 (Exponential families with conjugate priors).Let ν be a σ-finite measure on the Borel sets of R and X be the interior of the convex hull of the support of ν.Let M pθq " log `ş exppz ¨θqdνpzq ˘.Suppose Θ " tθ P R : M pθq ă 8u is open, that both X, Θ are non-empty and that M p¨q is differentiable for θ P Θ with strictly increasing derivative θ Þ Ñ M 1 pθq.Now, suppose µ i , Z ij are generated from model (3) as follows (with G, F implicitly defined).First, θ i is drawn from a prior with Lebesgue density proportional to exppK 0 pz 0 ¨θ´M pθqqq on Θ for some K 0 ą 0 and z 0 P X. Next, µ i " M 1 pθ i q and Z ij ˇˇµ i is drawn from the distribution with ν-density equal to exp pz ¨θi ´M pθ i qq, where θ i " pM 1 q ´1pµ i q.Diaconis and Ylvisaker [1979, Theorem 2] then prove that, Thus, m ˚P Lin `RK´1 ˘and Auroral matches the risk of m ˚up to the error term in Theorem 5.
7 See Supplement D for details regarding the regret bounds we discuss here.

Auroral estimation in location families
In this section we provide another example of Auroral estimation.In contrast to the rest of this paper, which treats K as fixed, we consider an asymptotic regime in which both K and N tend to 8 (with K growing substantially slower than N ).In doing so, we hope to provide the following conceptual insights: first, we provide a concrete setting in which Auroral dominates any method that first summarizes Z i as Zi .Second, we elaborate on the expressivity of the class Lin `RK´1 ˘( 16).
We consider model (3) with µ i " G for a smooth prior G and Z ij iid " F p¨| µ i q, where F p¨| µ i q has density f p¨´µ i q with respect to the Lebesgue measure and f p¨q is a density that is symmetric around 0.
We proceed with a heuristic discussion, that we will make rigorous in the formal statements below.Suppose for now that f p¨q is sufficiently regular with Fisher Information,8 Ipf q " Then, by classical parametric theory, we expect that K ¨RK pG, F q Ñ Ipf q ´1 as K Ñ 89 .On the other hand, another classical result in the theory of robust statistics [Bennett, 1952, Jung, 1956, Chernoff, Gastwirth, and Johns, 1967, Van der Vaart, 2000] states that for smooth location families, there exists an L-statistic (i.e., a linear combination of the order statistics) that is asymptotically efficient.By Theorem 5, we thus anticipate that the risk of Auroral is equal to p1 `op1qqIpf q ´1{K .In contrast, if we first summarize Z i by Zi , then the best estimator we can possibly use is the posterior mean, E " µ i | Zi ‰ .However, for K large (so that the likelihood swamps the prior), , and so the risk will behave roughly as σ 2 {K, where σ 2 " ş f 2 pxqdx.We summarize our findings in the Corollary below, and provide the formal proofs in Supplement E.
Corollary 6 (Smooth location families).Suppose that G satisfies regularity Assumption 1 and f p¨q satisfies regularity Assumption 2 (where both assumptions are stated in Supplement E.1).Then, in an asymptotic regime with K, N Ñ 8, K 2 {N Ñ 0, it holds that, where μAvg i can be either the CC-L estimator μCC-L i or the Bayes estimator E " Recalling that Ipf q ´1 ď σ 2 with equality when f p¨q is the Gaussian density, we see that Auroral adapts10 to the unknown density f p¨q and outperforms any estimator that first averages Z i .
What about location families that are not regular?Below we give an example, namely the Rectangular location family with f p¨q " U r´B, Bs, B ą 0 in which a similar conclusion to Corollary 6 holds.The advantage of Auroral is even more pronounced in this case (OpK ´2q risk for Auroral, versus OpK ´1q for any method that first averages Z i ).
Corollary 7 (Rectangular location family).Suppose f p¨q is the uniform density on r´B, Bs for a B ą 0, that G satisfies the regularity assumption 1 in Supplement E.1 and that K, N Ñ 8, K 3 {N Ñ 0. Then, lim sup

Empirical performance in simulations
In this section, we study the empirical performance of Aurora and competing empirical Bayes algorithms in three scenarios: homoskedastic location families (Section 6.1), heteroskedastic location families (Section 6.2) and a heavy-tailed likelihood (Section 6.3).

Homoskedastic location families
We start by empirically studying the location family problem from Section 5.2 for K " 10 replicates and N " 10 4 units.We first generate the means µ i from one of two possible priors G, parameterized by a simulation parameter A: the normal prior N p0.5, Aq and the threepoint discrete prior that assigns equal probabilities to ) .Both prior distributions have variance A.
We then generate the replicates Z ij around each mean µ i from one of three location families: Normal, Laplace, or Rectangular.The parameters of these distributions are chosen so that the noise variance is σ 2 " Var rZ ij | µ i s " 4.
• Standard estimators of location: The mean Zi , the median and the midrange (that is, μi " pmax j tZ ij u `min j tZ ij uq{2).
• Standard empirical Bayes estimators applied to the averages Zi : James-Stein (positive-part) shrinking towards ř N i"1 Zi {N (which we provide with oracle knowledge of σ 2 " 4) and the nonparametric maximum likelihood estimator (NPMLE) of Koenker and Mizera [2014] (as implemented in the REBayes package [Koenker and Gu, 2017] in the function "GLmix"), which is a convex programming formulation of the estimation scheme of Kiefer and Wolfowitz [1956] and Jiang and Zhang [2009].For the NPMLE we estimate the standard deviation for each unit by the sample standard deviation σ2 i over its replicates and use the working approximation Zi | µ i " N pµ i , σ2 i {Kq.
The results11 are shown in Figure 2. The standard location estimators have constant mean squared error (MSE) in all panels, since they do not make use of the prior.
We discuss the case of a Normal prior first: here the MSE of all methods is non-decreasing in A. Auroral closely matches the best estimator for every A and likelihood.In the case of Normal likelihood, James-Stein (with oracle knowledge of σ 2 ), CC-L and Auroral perform best,12 followed closely by Aurora-kNN and the NPMLE.The standard location estimators are competitive when the prior is relatively uninformative (i.e., A is large).Among these, the mean performs best for the Normal likelihood, the median for the Laplace likelihood and the midrange for the Rectangular likelihood.The three component prior highlights a case in which non-linear empirical Bayes shrinkage can be helpful.Here, Aurora-kNN performs best across all settings, closely followed by the NPMLE in the case of the Normal likelihood13 .The NPMLE is also the second most competitive method for the Laplace likelihood with large prior variance A. The behavior of all other methods tracks closely with their behavior under a Normal prior.
One may at this point wonder: How do the Auroral weights (coefficients) β in equation ( 14) look like?In Figure 3, we show these weights from a single replication (with N " 2 ¨10 5 and K " 10) for each of the three likelihoods considered above and for two Normal priors.First, we focus on the points in blue, which correspond to an uninformative prior (G " N p0.5, 400q).over the held-out replicate j of the Auroral algorithm (Table 1).

Normal
When the likelihood F is Normal, the Auroral weights are roughly constant and equal to 1{9.In other words, μAurL i,j , that is the Auroral fit with held-out replicate j, is (approximately) the sample mean Xi pjq of X p¨q i pjq.When the likelihood F is Laplace, the Auroral weights pick out the median X p5q i pjq and a few order statistics around it.When the likelihood F is Rectangular, Auroral assigns approximately 1{2 weight each to the minimum and the maximum and 0 weight to all of the other order statistics.In other words, μAurL i,j is (approximately) the midrange of X p¨q i pjq.Notice that Auroral did not know the likelihood F in any of these examples.Rather, it adaptively learned an appropriate summary from the data.
Next, we examine the difference between using informative versus uninformative priors G.When the prior is informative (G " N p0.5, 0.4q, orange in Figure 3), Auroral automatically learns a non-zero intercept, which is determined by the prior mean, and the remaining weights are shrunk towards zero.

Heteroskedastic location families
In our second simulation setting, we study location families where σ 2 i is also random and so we find ourselves in the heteroskedastic location family problem.Again we benchmark Auroral, Aurora-kNN (k chosen by leave-one-out cross-validation from the set t1, . . ., 1000u), CC-L, the Gaussian NPMLE and the sample mean.We also consider two estimators which have been proposed specifically for the heteroskedastic Normal problem, the SURE (Stein's Unbiased Risk Estimate) method of Xie et al. [2012] that shrinks towards the grand mean and the GL (Group-linear) estimator of Weinstein et al. [2018].We apply these estimators to the averages Zi .Both of these estimators have been developed under the assumption that the analyst has exact knowledge of σ 2 i ; so we provide them with this oracle knowledge (SURE (or.) and GL (or.) -the other methods are not provided this information).Furthermore, we apply the Group-linear method that uses the sample variance σ2 i calculated based on the replicates.
We use three simulations, inspired by simulation settings a), c) and f) of Weinstein et al. [2018]: in all three simulations we let N " 10000, K " 10.First we draw σ2 i " U r0.1, σ2 max s, where σ2 max is a simulation parameter that we vary.Then for the first setting we draw µ i " N p0, 0.5q, while for the last two settings we let µ i " σ2 i .Weinstein et al. [2018] use the latter as a model of strong mean-variance dependence.The methods that have access to σ2 i can in principle predict perfectly (i.e., the Bayes risk is equal to 0).Finally we draw max s, with σ2 max varying on the x-axis.Then µ i " N p0, 0.5q (in the left panel) or i q is a Normal location-scale family (first two panels) or Rectangular (last panel).The y-axis shows the mean squared error of the estimation methods.
Results from the simulations are shown in Figure 4.The oracle SURE and oracle Grouplinear estimators perform best in the first panel and oracle Group-linear strongly outperforms all other methods in the last two panels.This is not surprising, since oracle Group-linear has oracle access to σ 2 i and the method was developed for precisely such settings with strong mean-variance relationship.Among the other methods, Auroral and Aurora-kNN remain competitive.In the first panel they match CC-L, while in the last two panels they outperform Group-linear with estimated variances.We point out that Auroral outperforms CC-L in the second panel, despite the Normal likelihood.This is possible, because the mean is no longer sufficient in the heteroskedastic problem.

A Pareto example
For our third example we consider a Pareto likelihood, which is heavy tailed and nonsymmetric.Concretely, we let µ i " G " U r2, µ max s (with µ max a varying simulation parameter) and F p¨| µ i q is the Pareto distribution with tail index α " 3 and mean µ i .We compare Auroral, Aurora-kNN (Aur-kNN) with k chosen by leave-one-out cross-validation from the set t1, . . ., 100u (as described in Supplement B.2), CC-L, the sample mean and median, as well as the maximum likelihood estimator for the Pareto distribution (assuming the tail index is unknown).For this example we also vary pK, N q " p20, 10 4 q, p100, 10 4 q, p100, 10 5 q.The results are shown in Figure 5. Throughout all settings, Auroral performs best, followed by Aurora-kNN.All methods improve as K increases.Auroral and Aurora-kNN also improve as N increases.
where F p¨| µ i q is the Pareto distribution with mean µ i and tail index α " 3. The panels correspond to different choices for K and N .The y-axis shows the mean squared error of the estimation methods.
7 Application: Predicting treatment effects at Google In this section, we apply Auroral to a problem encountered at Google and other technology companies -estimating treatment effects at a fine-grained level.All major technology firms run large randomized controlled experiments, often called A/B tests, to study interventions and to evaluate policies [Tang et al., 2010, Kohavi et al., 2013, Kohavi and Longbotham, 2017, Athey and Luca, 2019].Estimation of the average treatment effect from such an experiment (e.g., comparing a metric between treated users and control users) is a well-understood statistical task [Wager et al., 2016, Athey andImbens, 2017].In the application we consider below, instead, interest lies in estimating treatment effects on fine-grained groups -a separate treatment effect for each of thousands of different online advertisers.In this setting, Empirical Bayes techniques, such as Auroral, can stabilize estimates by sharing information across advertisers.
To apply Auroral for the task of fine-grained estimation of treatment effects, we require replicates.Interestingly, the data from the technology firm we work with is routinely organized and analyzed using 'streaming buckets', i.e., the experiment data is divided into K chunks of approximately equal size that are called streaming buckets.The buckets correspond to (approximately) disjoint subsets of users, partitioned at random, and so data across different buckets are independent to sufficient approximation.We refer to Chamandy et al. [2012] for a detailed description of the motivation for using streaming buckets to deal with the data's scale and structure, as well as statistical and computational issues involved in the analysis of streaming bucket data.For the purpose of our application, each streaming bucket corresponds to a replicate in (3).
We model the statistical problem of our application as follows.The metric of interest is the cost-per-click (CPC), which is the price a specific advertiser pays for an ad clicked by a user.The goal of the experiment is to estimate the change in CPC before and after treatment.We have data for each advertiser (N advertisers) and two treatment arms; control (w " 0) and treated (w " 1).The data is further divided into K buckets.For each advertiser i " 1, . . ., N , bucket j " 1, . . ., K, and treatment arm w " 0, 1, we record the total number of clicks N ijw P N ą0 and the total cost of the clicks A ijw .We define CPC ijw :" A ijw {N ijw , the empirical cost-per-click for advertiser i in bucket j and treatment arm w. the advertisers are the units and the buckets are the replicates.Each observation is Z ij " CPC ij1 ´CPC ij0 , and the goal is to estimate the treatment effect µ i , α i from model ( 3) capture advertiser-level idiosyncrasies.We consider the following estimation strategies: 1.The aggregate estimate: we pool the data in all buckets and then compute the difference in CPCs, i.e., μi " 2. The mean of the Z ij , i.e., μi " 1 K ř K j"1 Z ij .3. The CC-L estimator, wherein for each j we use ordinary least squares to regress Y i pjq onto the aggregate estimate in the other K ´1 buckets, i.e., onto The CC-L estimator, wherein for each j we use ordinary least squares to regress Y i pjq onto the mean of the other K ´1 buckets, i.e., onto 5. The Auroral estimator that (for each j) regresses Y i pjq on the order statistics X p¨q i pjq.We empirically evaluate the methods as follows: we use data from an experiment running at Google for one week, retaining only the top advertisers based on number of clicks, resulting in N ą 50, 000.The number of replicates is equal to K " 4. As ground truth, we use the aggregate estimate based on experiment data from the 3 preceding and 3 succeeding weeks.Then, we compute the mean squared error of the estimates (calculated from the one week) against the ground truth and report the percent change compared to the aggregate estimate.
The results are shown in Table 2.The table also shows the results of the same analysis applied to a second metric, the change in click-through rate (CTR), which is the proportion of times that an ad by a given advertiser which is shown to a user is actually clicked.We observe that the improvement in estimation error through Auroral is substantial.Furthermore, Auroral outperforms both variants of CC-L, which in turn outperform estimators that do not share information across advertisers (i.e., the aggregate estimate and the mean of the Z ij ).

Conclusion
We have presented a general framework for constructing empirical Bayes estimators from K noisy replicates of N units.The basic idea of our method, which we term Aurora, is to leave one replicate out and regress this held-out replicate on the remaining K ´1 replicates.We then repeat this process over all choices of held-out replicate and average the results.We have shown that if the K ´1 replicates are first sorted, then even linear regression produces results that are competitive with the best methods, which usually make parametric assumptions, while our method is fully nonparametric.
We conclude by mentioning some direct extensions of Aurora that are suggested by its connection to regression.

More powerful regression methods:
In this paper, we have used linear regression and k-Nearest Neighbor regression to learn mp¨q.But we can go further; for example, we could use isotonic regression [Guntuboyina and Sen, 2018] based on a partial order on X p¨q i .Or we could combine linear and isotonic regression by considering single index models with non-decreasing link function [Balabdaoui et al., 2019], i.e., predictors of the form mpx p¨q q " tpα J x p¨q q, where α 2 " 1 and t is an unknown non-decreasing function.Other possibilities include recursive partitioning [Breiman et al., 1984, Zeileis et al., 2008], in which linear regression is fit on the leaves of a tree, or even random forests aggregated from such trees [Friedberg et al., 2020, Künzel et al., 2019].

More general targets:
We have only considered estimation of µ i " ErZ ij ˇˇµ i , α i s in (3).As pointed out in Johns [1957Johns [ , 1986] ] this naturally extends to parameters θ i " ErhpZ ij q ˇˇµ i , α i s where h is a known function.The only modification needed to estimate θ i is that we fit a regression model to learn ErhpY i q ˇˇX p¨q i s instead.We may further extend Vernon Johns' observation to arbitrary U-statistics.Concretely, given r ă K, r P N and a fixed function h : R r Ñ R, we can use Aurora to estimate θ i " E " hpZ i1 , Z i2 , . . ., Z ir q ˇˇµ i , α i ‰ .In this case we need to hold out r replicates to form the response.For example, with r " 2 and hpz 1 , z 2 q " pz 1 ´z2 q 2 {2, we can estimate the conditional variance σ 2 i " Var " Z ij ˇˇµ i , α i ‰ .Denoising the variance with empirical Bayes is an important problem that proved to be essential for the analysis of genomic data [Smyth, 2004, Lu andStephens, 2016].However, these papers assumed a parametric form of the likelihood, while Aurora would permit fully nonparametric estimation of the variance parameter.
External covariates: Model (3) posits a priori exchangeability of the N units.However, in many applications, domain experts also have access to side-information ζ i about each unit.Hence, multiple authors [Fay and Herriot, 1979, Tan, 2016, Kou and Yang, 2017, Banerjee et al., 2020, Coey and Cunningham, 2019, Ignatiadis and Wager, 2019] have developed methods that improve mean estimation by utilizing information in the ζ i .Aurora can be directly extended to accommodate such side information.Instead of regressing Y i on X p¨q i , one regresses Y i on both X p¨q i and ζ i .

Software
We provide reproducible code for our simulation results in the following Github repository: https://github.com/nignatiadis/AuroraPaper.A package implementing the method is available at https://github.com/nignatiadis/Aurora.jl.The package has been implemented in the Julia programming language [Bezanson et al., 2017].

A Proofs for Section 2 A.1 Proof for Proposition 2
Proof.For the LHS inequality, it suffices to note that m ˚pZ i q is a function of Z i , and so is dominated by the Bayes rule in terms of mean squared error.For the RHS, we proceed as follows.First, ıı " Var rµ i s ´Varrm ˚pX p¨q i qs.piq follows from the law of total variance and the other equalities are a consequence of the definition m ˚pX p¨q In the last line we used again the fact that m ˚pX p¨q i q " Erµ i | X p¨q i s.We now proceed with the key step of our argument: piq and piiiq follow from the law of total variance.For piiq we used two results: for the right part, we used the fact that E rm ˚pZ i q | µ i , α i s " Erm ˚pX p¨q i q | µ i , α i s.For the left part, as announced in the main text, we used Theorem 2 of Efron and Stein [1981].
We conclude by combining the preceding displays.

A.2 Proof for Theorem 3
Proof.We first obtain a decomposition for a single coordinate i.For simplicity, we suppress the dependence on G and F and first prove the result for μi :" μAur i,j , i..e, Aurora based on a single held-out response replicate j P t1, . . ., Ku.
To see why the cross-term E "`µ i ´E " µ i ˇˇZ ‰˘`E " µ i ˇˇZ ‰ ´μ i ˘‰ vanishes in the second equality above, observe that the factor `E " µ i ˇˇZ ‰ ´μ i ˘is measurable with respect to Z. Therefore, the expectation conditional on Z is zero (almost surely), so the unconditional expectation (i.e., the cross-term) is also zero.
Next, we examine the quantity inside the expectation in the second term of (S1).By adding and subtracting m ˚pX p¨q i q :" E " µ i ˇˇX p¨q i ı and using the inequality pa`bq 2 ď 2a 2 `2b 2 , we obtain Now, we take the expectation of (S2).For the first term, we can repeat the argument from (S1) to obtain which can be rearranged to show that 2E " ´E " To summarize, we have the following result for a single coordinate i: Finally, we average over all i to obtain the desired result: The proof for μi :" μAur i is the same verbatim, if we replace m ˚pX p¨q i q by m ˚pZ i q, R K´1 pG, F q by R K´1 pG, F q and Err pm ˚, mq by Err pm ˚, mq.

B Aurora with Nearest Neighbors
B.1 Proof for Theorem 4 Proof.Throughout the proof we assume that ties among the X p¨q i pjq happen with probability 0. We avoid ties as follows.As in Chapter 6 of Györfi et al. [2006], we use kNN to regress Y i pjq on pX p¨q i pjq, U i pjqq, where U i pjq are independently drawn from U r0, εs for some small, fixed ε ą 0. The regression function remains the same, since almost surely ErY i pjq ˇˇX p¨q i pjqs " ErY i pjq ˇˇX p¨q i pjq, U i pjqs.We will however suppress the U i pjq from our notation and assume ties do not occur for the X p¨q i pjq; otherwise the proofs go through verbatim by replacing X p¨q i pjq by pX p¨q i pjq, U i pjqq in all subsequent arguments.We first claim that: ¯.

(S3)
To see this, first note that: By the Cauchy-Schwarz inequality: By definition it holds that Erpµ i ´m˚p Z i qq 2 s " R K´1 pG, F q.By ( 13) and permutation equivariance (with respect to permutations of the units) of the kNN estimator mp¨q (when there are no ties) it holds that: Taking expectations in (S4), using the above results and rearranging, we conclude with the claim in (S3).Returning to the main proof, in view of (S3), it suffices to show that: By exchangeability of units in (3) and permutation equivariance of the kNN estimator mp¨q without ties, (S5) is equivalent to proving that: We prove this by instead considering the kNN estimator mN,´1 p¨q applied to all observations except the first; however still with the same number of nearest neighbors k " k N (instead of k N ´1).Then: The first of these terms converges to 0 by existing results on universal consistency in nonparametric regression, concretely Theorem 6.1 of Györfi et al. [2006].We need to show that the second term also converges to 0. Let Consequently: The first term goes to 0, since we assumed that We handle the second term as follows: We elaborate on two steps: piq holds by exchangeability of the pX p¨q i , Y i q, i " 1, . . ., N .piiq holds for a constant γ ă 8 that depends only on the dimension K by Corollary 6.1.of Györfi et al. [2006].To conclude we divide by k 2 and the result follows since we assumed that k Ñ 8 and E " Y 2 1 ‰ ă 8.

B.2 Tuning of Aurora-kNN by cross-validation
Aurora-kNN is a specific instantiation of the general Aurora algorithm (Table 1) with a specific choice of Step 3, which takes the form of leave-one-out cross-validated k-Nearest Neighbor regression (see e.g., Azadkia [2019] and references therein).The box below presents the algorithm in detail: Aurora-kNN: Aurora with supervised predictions based on k-Nearest Neighbor regression (k chosen by leave-one-out cross-validation).Input: Z i , i " 1, . . ., N : Replicated samples (K replicates per unit).k max : Integer upper bound on number of nearest neighbors to consider.
for j " 1, . . ., K do 1 Split the replicates for each unit i, Z i , into X i :" pZ i1 , . . ., Z ipj´1q , Z ipj`1q , . . ., Z iK q and Y i :" Z ij , as in (5). 2 For each i order the values of X i to obtain X p¨q i .
3 Preprocess X p¨q i , i " 1, . . ., N to facilitate nearest neighbor searches.for i " 1, . . ., N do Let opi, 1q, . . ., opi, k max ´1q be the indices of the k max ´1 nearest neighbors of X p¨q i , excluding i and sorted by increasing distance (i.e., X Compute the leave-one-out cross-validation error of nearest neighbor regression with k neighbors, i.e., LOOpkq " 1 4 Compute the kNN prediction with k " k j for all i, i.e., μAur i,j :" Two remarks are in order: 1.A naive approach to finding all nearest neighbors in step 3 of the algorithm (say, for a fixed held-out response replicate j) has computational complexity OpN 2 Kq.By preprocessing all ordered samples (at the beginning of step 3), this computational complexity may be decreased, especially for small K.In our default implementation we preprocess the data using a kd-tree as implemented in the NearestNeighbors.jl[Carlsson et al., 2020] package when K ď 12,.For K ą 12 we use brute-force search of nearest neighbors.
2. The leave-one-out calculation can be substantially sped up by reusing the computation S5 from step k when computing LOOpk `1q via the following elementary identity: C Proof for Auroral estimator (Theorem 5) Proof.Throughout this proof we let P Xpjq be the orthogonal projection operator onto the linear space spanned by the columns of X p¨q pjq and the ones vector p1, . . ., 1q J .We also use vectorized notation, e.g., µ " pµ 1 , . . ., µ N q J , Y pjq " pY 1 pjq, . . ., Y N pjqq J .With this notation it holds that First, by Jensen's inequality: Thus it suffices to bound 1 for a fixed j.In doing so, we omit j from the notation, e.g., we write X p¨q instead of X p¨q pjq, P X instead of P Xpjq and Y instead of Y pjq.We also (with slight abuse of notation) write m ˚" m ˚pX p¨q q.It then holds that: It remains to bound the terms I, II, III.
Bound on I: We claim that I ď R Lin K´1 pG, F q.This holds since: Thus: Bound on II: The argument here is similar to results on fixed design linear regression, see e.g., Theorem 11.1 in Györfi et al. [2006].We have that: where VarrY ˇˇX p¨q s is the N ˆN diagonal matrix with i-th diagonal entry equal to VarrY i ˇˇX p¨q i s.To conclude with the bound on Er P X m ˚´P X Y 2 2 s, we provide two separate bounds on ErTrpP X VarrY ˇˇX p¨q sqs according to whether Assumption (i) or (ii) of Theorem 5 holds.
We start with (i).We note that VarrY ˇˇX p¨q s ĺ max i tVarrY i ˇˇX p¨q i su ¨IN in the positive semi-definite order.By trace properties and since P X is an orthogonal projection matrix, it thus holds that: Integrating over X p¨q and using assumption (i), we conclude that: We turn to (ii).Let h i " pP X q ii , then: For the penultimate inequality we used Cauchy-Schwarz and for the last inequality, we used the following two intermediate results: first, Er ř N i"1 h i s " ErTrpP X qs ď K and so by symmetry Bound on III: It remains to bound the cross-term III.We will show that III ď 0. The crucial fact we use is that ErY ˇˇµ, X p¨q s " ErY ˇˇµs " µ: ı " pµ ´PX m ˚qJ P X pm ˚´µq p˚q " pµ ´m˚q J P X pm ˚´µq p˚˚q ď 0.
Integrating over X p¨q , µ we conclude.Let us justify p˚q, p˚˚q.p˚q follows because pm ˚´P X m ˚qJ P X pm ˚´µq " 0, since pm ˚´P X m ˚q is orthogonal to the space spanned by P X .p˚˚q holds since for any vector v, it holds that v J P X v ě 0 by positive semidefiniteness of P X .
The claim at the end of Theorem 5 follows directly by Theorem 3 and the bounds we derived above for II, since: ı .

D Normal prior and Homoskedastic Normal likelihood
We consider the more general case where Z ij ˇˇµ i " N pµ i , σ 2 q.The results for the example may be recovered by taking σ 2 " 1.Throughout we assume that σ 2 , A ą 0.

D.1 Oracle risks: Details for Example 1
The three oracle estimators are: Zi , where Zi is the sample average of Z i1 , . . ., Z iK .They have the following oracle risks: the RHS of Proposition 2, equation ( 11), holds with equality, i.e., Proof.The first two calculations are standard, and we provide the details for m ˚and its risk only.First, by definition of m ˚: Zi .
To evaluate the risk, let us write λ " A{pA `σ2 {pK ´1qq and λ ˚" A{pA `σ2 {Kq (recall λ ˚Z i is the Bayes rule here).Then: Furthermore:

D.2 Data-driven risks: Details for Example 3
Recall that the Bayes risk R K pG, F q is equal to `Aσ 2 ˘{ `AK `σ2 ˘.The upper bounds on the risk of the three estimators considered in this example are as follows:. Auroral: CC-L: James-Stein: Proof.We consider each estimator separately.
Auroral: The result follows from Theorem 3 along with the bound on Err pm ˚, mq derived in Theorem 5. To apply the latter, we need to compute VarrY i | X p¨q i s. Thus: CC-L: The derivation is similar to the result for Auroral above and is omitted.The main difference is that we solve a linear least squares problem with an intercept and a single regressor (instead of an intercept and K ´1 regressors).
James-Stein: The calculations are standard, e.g., see Chapter 1 of Efron [2012].First, using Stein's identity we get that: Now notice that Z 2 2 " pσ 2 {K `Aqχ 2 N , where χ 2 N is the χ 2 distribution with N degrees of freedom, and so E " pN ´2qpσ 2 {K `Aq{ Z 2 2 ı " 1, which in turn yields: Dividing by N we get the required claim.

D.2.1 Auroral and CC-L with a single held-out response
We conclude this section by noting that we can compute the risk of Auroral and CC-L exactly, if we omit the averaging over j in the Aurora algorithm.In particular: Proof.We start with Auroral.Throughout we use the same notation as in Section C. Further, we write μ for μAurL ¨,j " P Xpjq Y pjq " P X Y .First: Notice For the remaining term we condition on X p¨q and observe that P X ErY ˇˇX p¨q s " ErY ˇˇX p¨q s and that We conclude by using the expression for VarrY i | X p¨q i s derived in (S6), by iterated expectation and by rearranging terms.The result for CC-L can be derived in a similar way.We note that the formula for CC-L appears (with a typo) as Corollary 1 in Coey and Cunningham [2019].
E Location families: Formal statements for Section 5.2

E.1 Regularity assumptions
Assumption 1 (Regular prior).The prior distribution G has compact support rt 1 , t 2 s, t 1 ă t 2 , and has density g w.r.t. the Lebesgue measure.Furthermore, g is absolutely continuous on rt 1 , t 2 s, satisfies gpt 1 q " gpt 2 q " 0 and also has finite Fisher information: Assumption 2 (Regular location density).We assume that F p¨| µ i q has Lebesgue density f p¨´µ i q with f p¨q a fixed density and write (with some abuse of notation) F ptq " ş p´8,ts f pxqdx for the corresponding distribution.We assume that: (i) f p¨q is symmetric around 0.
The Laplace density does not satisfy the above regularity conditions.However, we may manually check that the conclusion of Corollary 6 holds in that case as well (see Remark S2 below for the proof in the Laplace case).

E.2 Proof of Corollary 6
We split up the proof over 3 individual parts, for each of the possible estimators considered.

E.2.1 Proof for CC-L estimator
Proof.To simplify the exposition of the proof, we make a centering assumption that E G rµs " 0 and we fit the linear regression without intercept (β 0 " 0).The results are identical for the uncentered case where E G rµs ‰ 0 and we use an intercept.Lower bound: The estimator takes the form, μCC-L Let us write: We will justify p˚q below.From the Central Limit Theorem and Slutsky it follows that, ?
By exchangeability: ´µ1 q 2 ‰ and so we have established that: It remains to show p˚q.It suffices to show that: For the first of these results, note that: By the law of large numbers, it holds that Z1 " O P p1q.Furthermore, max K j"1 t|Z 1j | {Ku " o P p1q, since for any ε ą 0 For the second part of (S8), let us note that E " Y i pjq Xi pjq ‰ " A and E " Xi pjq 2 ‰ " A`σ 2 {pK 1q, where A " ş µ 2 gpµqdµ.As a first consequence we have that E rM i pjqs " 0, where: Second, for any ε 1 ą 0 and any j, we get that: Next, fixing another ε ą 0, it holds that: We conclude by noting that Upper bound: Repeating the argument of the proof of Theorem 5 and noting that Var " Y i ˇˇX i ‰ ď C for some C ă 8 by our assumptions on the support of µ and since σ 2 i " Var " Y i ˇˇµ i ‰ is constant for location families, we get: The first term is upper bounded by σ 2 {pK ´1q by plugging in β " 1 and since K{N Ñ 0 we get: Proof.Lower bound: Let σ 2 " Var " Z i ˇˇµ i ‰ (it does not depend on µ i for location families) and define the normalized sum U i,K " ř K j"1 Z ij { ?Kσ 2 .Since σ 2 ą 0 and Ipf q ă 8, Johnson and Barron [2004, Theorem 1.6] prove that: where IpU i,K q is -with some abuse of notation -the location Fisher information of the Lebesgue density of U i,K conditionally on the location parameter µ i . 15Next, note that Zi " U i,K σ{ ?K and so Ip Zi q " KIpU i,K q{σ 2 .By Corollary S1 below (van Trees inequality), applied for Zi and one replicate, Combining the upper and lower bounds, we find that, 15 Part of the statement is also that IpU i,K q exists and is finite for all K large enough.

E.2.3 Proof for Auroral estimator
Proof.Lower bound: By Corollary S1 below (van Trees inequality) it holds for the Bayes risk that, R K pG, F q ě 1 KIpf q `Ipgq . So: Upper bound: The first step consists of upper bounding the risk R Lin K´1 pG, F q. We will do this by choosing an appropriate function from the class Lin `RK´1 ˘, cf. ( 16) and bounding its mean squared error for estimating µ i .To this end, fix β p0q for the intercept and let β pjq " hpj{Kq, j " 1, . . ., K ´1 for a p1{2 `δq-Hölder continuous function h : r0, 1s Ñ R to be chosen below.The L-statistic we study takes the form We may write X ij " µ i `X ij where Xij " f p¨q is 0-centered, so that: Here Xpjq i " X pjq i ´µi .By Hölder continuity of h: Choosing β p0q to be ´1 K´1 ř K´1 j"1 hpj{KqEr Xpjq i s, we thus get that (with the OpK ´1{2´δ q term being uniform over µ i in the support of G) Van der Vaart [2000, Proof of Theorem 22.3] establishes that: ´1 ż ż hpF pxqqhpF pyqq pF px ^yq ´F pxqF pyqq dxdy `op1{Kq, where the op1{Kq does not depend on µ i .Now let us make the concrete choice hpuq " ´1 Ipf q 2 `F ´1puq ˘.At the end of the proof, we will show that for this choice of h: ż ż hpF pxqqhpF pyqq pF px ^yq ´F pxqF pyqq dxdy " 1 Ipf q (S10) Thus we get that E " pT i ´µi q 2 ‰ " BiaspT i ˇˇµ i q 2 `Var " and so, KR Lin K´1 pG, F q ď 1 Ipf q p1 `op1qq as K Ñ 8. (S11) Therefore, by Theorem 5 and noting that VarrY i ˇˇX p¨q i s ď C for some C ă 8 (arguing as in the proof for CC-L), we get The conclusion follows by taking N Ñ 8 since then K Ñ 8, K 2 {N Ñ 0, i.e., we get that, Fisher information calculations: It remains to prove (S10).First ż 1 0 hpuqdu " ´1 Ipf q ż 1 0 2 `F ´1puq ˘du " ´1 Ipf q ż 2 puqf puqdu " Ipf q Ipf q " 1.
For the second term, we first note that for the given choice of h: ż ż hpF pxqqhpF pyqq pF px ^yq ´F pxqF pyqq dx dy " 1 Ipf q 2 ż ż 2 pxq 2 pyq pF px ^yq ´F pxqF pyqq dx dy.
Thus we need to show that the numerator of the last expression is equal to Ipf q " E " 1 pXq 2 ‰ , where X " f p¨q.To this end, first let X 1 be an i.i.d.copy of X.Then, since E r 1 pXqs " 0, it follows that E " 1 pXq 2 ‰ " E " p 1 pXq ´ 1 pX 1 qq 2 ‰ {2.On the other hand, by absolute continuity of 1 , we may write X, X 1 almost surely: 1 pXq ´ 1 pX 1 q " ż 2 pxq `1pX 1 ď xq ´1pX ď xq ˘dx.
Assume momentarily that we can apply Fubini's theorem, then E " p 1 pXq ´ The scale parameterization is such that σ 2 " ş f 2 pxqdx.We note that f p¨q is absolutely continuous and is differentiable in quadratic mean with Fisher Information equal to Ipf q " 2{σ 2 [Lehmann and Romano, 2006, Example 12.2.4].
These properties of f p¨q suffice for most steps of the proof of Corollary 6.The higher order smoothness is required only for deriving the upper bound for the Auroral estimator.In particular, the construction of a L-statistic in (S9) used to upper bound R Lin K´1 pG, F q in (S11), no longer works.However, the following simple choice works instead: we choose T i as the median of X i .For simplicity we take limits over even K, i.e., we assume that K " 2b for b P N ě1 and take T i " X pbq i and b Ñ 8. Chu and Hotelling [1955] then prove that: E " pT i ´µi q 2 ı N ˆ1 4f 2 p0qpK ´1q ˙Ñ 1 as b Ñ 8.

E.2.4 Proof of Corollary 7
As in the proof of Corollary 6, we need to bound the risk of each of the three estimators separately.It will be convenient to do the following preliminary calculation.In the rectangular location family, CC-L: All steps of the proof in Section E.2.1 go through, and so we get, Posterior mean based on average, E " µ i | Zi ‰ : In this case the proof of Section E.2.2 goes through, and we get the same expression as for CC-L above.There is one subtlety involved in applying Johnson and Barron [2004, Theorem 1.6]: the Fisher information of the rectangular distribution is not well-defined.Nevertheless, the convolution of f p¨q with itself, i.e., the density of Z i1 `Zi2 , is equal to the triangular density, which has finite Fisher information and is absolutely continuous.Thus, we may apply Theorem 1.6 of Johnson and Barron [2004].
Auroral: Let T i " pX p1q i `XpK´1q i q{2 be the sample midrange based on X p¨q i .T i is unbiased for µ i and Rider [1957, Section 4] showed that Var " T i ˇˇµ i ‰ " 2B 2 {pKpK `1qq.Since T i P Lin `RK´1 ˘, it follows that KpK `1qR Lin K´1 pG, F q ď 2B 2 .Applying Theorem 5 and noting that VarrY i ˇˇX p¨q i s ď C for some C ă 8, we get Since K 3 {N Ñ 0, we conclude that: We conclude with the statement of the corollary by combining the results for the three estimators.

E.2.5 The van Trees inequality
Corollary S1 (The van Trees inequality for location families).In the setting of Section 5.2, assume that: (a) The density f p¨q w.r.t. the Lebesgue measure is absolutely continuous and symmetric around 0.
(c) The prior distribution G has compact support rt 1 , t 2 s and has density g w.r.t. the Lebesgue measure.Furthermore, g is absolutely continuous on rt 1 , t 2 s, satisfies gpt 1 q " gpt 2 q " 0 and also has finite Fisher information: Ipgq " ż g 1 pxq 2 gpxq 1pgpxq ą 0qdx ă 8.
Then, the Bayes risk R K pG, F q for estimating µ i satisfies: R K pG, F q ě 1 KIpf q `Ipgq .
Proof.This is a direct corollary of the van Trees inequality [Van Trees, 1968, Gill andLevit, 1995].Concretely, here we apply the form presented in Theorem 2.13 in Tsybakov [2008].
There results are phrased more generally for models ppx, tq, where pp¨, tq is the Lebesgue density for a fixed parameter value t.Here we have ppx, tq " f px ´tq, which simplifies results.Concretely, we quickly verify the three assumptions (i),(ii),(iii) of Theorem 2.13 in Tsybakov [2008]: For (i), joint measurability of ppx, tq in px, tq follows from absolute continuity of f p¨q.For piiq we use translation invariance of the Lebesgue measure to note that the Fisher information is constant as a function of the location parameter.We also use the fact that the Fisher Information in the experiment where we observe K independent replicates is equal to Ktimes the Fisher information in the experiment with a single replicate.Finally, assumption (iii) in Tsybakov [2008] is identical to assumption (c).

p
the Normal model with K replicates:• Coey and Cunningham [2019] (CC-L): We proceed similarly to the Auroral algorithm with one modification.For the j-th held-out replicate we regress Y i pjq on Xi pjq " 1 K´1 ř j 1 ‰j Z ij 1 using ordinary least squares to obtain mCC-Xi pjqq.Finally we estimate µ i by μCC-L

Figure 2 :
Figure 2: Homoskedastic location families: The MSE of the 8 estimators, as a function of the prior standard deviation.Each column represents a different prior distribution: normal (left) and a three-point distribution (right).Each row represents a different location family for the likelihood (Normal, Laplace, Rectangular).

Figure 3 :
Figure 3: The coefficients of the intercept and the order statistics in the linear regression model for m in Auroral:The colors represent different choices of prior G (defined in the legend), while the panels represent different choices of likelihood F .The coefficients shown have been averaged over the held-out replicate j of the Auroral algorithm (Table1).

Figure 5 :
Figure5: Pareto distribution example: Data are generated as µ i " U r2, µ max s, with µ max varying on the x-axis and Z ij | µ i " F p¨| µ i q, where F p¨| µ i q is the Pareto distribution with mean µ i and tail index α " 3. The panels correspond to different choices for K and N .The y-axis shows the mean squared error of the estimation methods.

Table 2 :
Empirical performance on the advertiser-level estimation problem: Percent change in mean squared error for estimating change in cost-per-click (∆CPC) and change in click-through rate (∆CTR) compared to the aggregate estimate (˘standard errors).