An Example of an Improvable Rao–Blackwell Improvement, Inefficient Maximum Likelihood Estimator, and Unbiased Generalized Bayes Estimator

The Rao–Blackwell theorem offers a procedure for converting a crude unbiased estimator of a parameter θ into a “better” one, in fact unique and optimal if the improvement is based on a minimal sufficient statistic that is complete. In contrast, behind every minimal sufficient statistic that is not complete, there is an improvable Rao–Blackwell improvement. This is illustrated via a simple example based on the uniform distribution, in which a rather natural Rao–Blackwell improvement is uniformly improvable. Furthermore, in this example the maximum likelihood estimator is inefficient, and an unbiased generalized Bayes estimator performs exceptionally well. Counterexamples of this sort can be useful didactic tools for explaining the true nature of a methodology and possible consequences when some of the assumptions are violated. [Received December 2014. Revised September 2015.]


INTRODUCTION
Statistical theory courses usually start with basic notions for describing how much information on some unknown parameter θ can be obtained from a set of data Y. These are likelihood, sufficiency, minimal sufficiency, completeness, the Rao-Blackwell (Rao 1945;Blackwell 1947), and Lehmann-Scheffé theorems (Lehmann andScheffé 1950, 1955). Familiarity with these notions is assumed.
The Rao-Blackwell theorem (RBT) offers a procedure (coined "Rao-Blackwellization" seemingly by Berkson 1955) for improving a crude unbiased estimator g (Y ) of the parameter θ into a better one (in mean-squared-error or any other convex loss function), by taking the conditional expectation of g (Y ) given some sufficient statistic T = T (Y ), that is,θ RB = E θ [g (Y ) |T ] (this is a statistic because T is sufficient). The Lehmann-Scheffé theorem and RBT taken together state that the (unique) unbiased estimator based on a complete minimal sufficient statistic T achieves uniformly smaller expected loss under any convex loss function (the common term UMVUE, uniformly minimum-variance unbiased estimator, stresses only squared loss). Furthermore, if a parameter can at all be unbiasedly estimated, then it can also be unbiasedly estimated by a function of T, and the Rao-Blackwell improvement of the former automatically leads to the latter. If an unbiased estimator of a parameter is not a function of the sufficient statistic T, the Rao-Blackwell improvement based on T is a strict improvement over the original unbiased estimator.
Classical examples for these ideas are often based on unconstrained exponential families of distributions in which a complete sufficient statistic always exists, and the Rao-Blackwell improvement yields optimal unbiased results (see Abramovich and Ritov 2013). However, applying Rao-Blackwell improvements with noncomplete minimal sufficient statistic T will always yield some estimator that fails to have minimal possible variance, at least somewhere in the parameter space: Any non-degenerate function of T with mean identically zero (the existence of such is the definition of noncompleteness) is an unbiased estimator of the zero function that is left unchanged by Rao-Blackwellization because it is already a function of T, but is dominated (with strict variance inequality somewhere) by the zero statistic.
The didactically motivated family of examples to be introduced does not depend on delicate measure-theoretical pathologies of noncountably generated families of distributions. It deals instead with commonplace positive uniformly distributed random variables parameterized by their mean, and yields a uniformly dominated Rao-Blackwell improvement. Consider the uniform distributions U ((1 − k)θ, (1 + k)θ ) with unknown mean θ > 0 and known design parameter k ∈ (0, 1), henceforth the scale-uniform family of distributions. This family does not satisfy the usual differentiability assumptions leading to Fisher Information, the Crámer-Rao bound and efficiency of maximum likelihood estimators (MLEs). For this family, MLE is inefficient.
While proper Bayes estimators cannot be unbiased (Bickel and Blackwell 1967), unbiased examples have been built under improper priors, such as the sample mean in the normal case. An explicit Bayes estimator built under a specific improper prior will be shown to be unbiased, simultaneously for all sample sizes and all k.

AN IMPROVABLE RAO-BLACKWELL IMPROVEMENT
Let X 1 , X 2 , . . . , X n be a random sample from a scale-uniform distribution X ∼ U ((1 − k)θ, (1 + k)θ ), with unknown mean E[X] = θ and known design parameter k ∈ (0, 1). In the search for "best" possible unbiased estimators for θ , it is natural to consider X 1 as an initial (crude) unbiased estimator for θ and then try to improve it. Since X 1 is not a function of T = X (1) , X (n) , the minimal sufficient statistic for θ (where X (1) = min(X i ) and X (n) = max(X i )), it may be improved using the Rao-Blackwell theorem as follows: since X 1 |T ∼ U (X (1) , X (n) ). Although this estimator has lower variance than X 1 (and X n ), it is not even best among the linear unbiased estimators for θ . For this family of uniform distributions, there exists an unbiased estimatorθ LV for θ with uniformly lower variance than the Rao-Blackwell improvement of X 1 . Bothθ LV andθ RB are linear in X (1) and X (n) , but with different coefficients. This will be shown in two steps: first, two unbiased estimatorsθ m andθ M of θ will be defined, as constant multiples of (the minimum) X (1) and (the maximum) X (n) , respectively. Second, among the affine combinations ofθ m and θ M , a minimum-variance solutionθ LV will be found, with strictly lower variance thanθ RB , uniformly in θ > 0. This idea of taking an affine combination of two unbiased estimators to define a new unbiased estimator appeared at least 200 years ago in a work by Laplace (see Stigler 1973), and is still found in modern use (e.g., see recent work by Damilano and Puig 2004).
As a preliminary, let Y 1 , Y 2 , . . . , Y n be iid, with common Step 1: Substituting a = (1 − k)θ and b = (1 + k)θ , the last formulas take the form giving rise to the "basic" unbiased estimators of θ These variances reveal that X (n) is strictly more informative about θ than X (1) throughout the range of n and k. Indeed, the variance ofθ M is uniformly smaller than the variance ofθ m , even asymptotically: Step 2: When k < 1, the basic estimatorsθ m andθ M generate the more general family of unbiased estimatorsθ (α) mM of θ obtained as their affine combinationŝ from which it is easy to identify the Rao-Blackwell improvementθ RB of X 1 as the case with α = α RB = 1 2 (1 + n−1 n+1 k) > 1 2 , andθ LV as the standard analytically derived case with and minimal variance that in the present case becomes which yields the following two representations forθ LV The asymptotic standard deviation σ [θ LV ], proportional to k 2 1+k 2 , is smaller than σ [θ M ], proportional to 2k 1+k . However, their asymptotic ratio ranges from only √ 2 (as k ↓ 0) to 1 (as k ↑ 1), which means that the two estimatorsθ M andθ LV are of comparable accuracy throughout the range, whileθ m becomes infinitely worse as k → 1.
We proceed now to analyze the Rao-Blackwell improvement θ RB . From a probability point of view (i.e., when θ is a known constant), E θ X 1 |X (1) , X (n) treats X (1) and X (n) symmetrically. Indeed, the two are homoscedastic and stochastically equidistant from θ . From a statistics point of view (i.e., when learning about θ ), X (n) is more informative than X (1) , in the sense that θ is unbiasedly estimated more accurately by contracting X (n) than by expanding X (1) . Thus,θ M should be intuitively expected to have a more decisive weight.

A Remark on Minimal But NonComplete Sufficient Statistics
The scale-uniform example given illustrates that using the Rao-Blackwell theorem with a noncomplete minimal sufficient statistic on a crude initial (unbiased) estimator does not always yield an estimator with the lowest possible variance. And in fact, it never will. Any unbiased estimator that is a function of the minimal sufficient statistic is its own Rao-Blackwell improvement. If one such estimator has larger variance than another (such asθ m vs.θ M ), then the first has been improved, and if the two have equal variances, their average improves both. This is in essence the message in Torgersen's converse to the Rao-Blackwell theorem (Torgersen 1988).

INEFFICIENCY OF THE MAXIMUM LIKELIHOOD ESTIMATOR
Under the usual differentiability assumptions, MLEs converge to the true parameter value at rate n − 1 2 efficiently, that is, with asymptotic variance that achieves the Crámer-Rao lower bound for the variance of unbiased estimators. In the model at hand these assumptions are violated and the parameter θ can be estimated, as seen above, with a faster rate of consistency n −1 .
The likelihood function exhibits a nice feature of the Rao-Blackwell improvementθ RB = 1+k 2 L + 1−k 2 H : it is the only unbiased estimator of θ in the classθ (α) mM that is guaranteed to obtain values in the feasibility interval [L, H ], since it is the only one with coefficients adding up to 1 when viewed as a linear combination of L and H. Of course, feasibility is guaranteed for MLE (= L) and Bayesian estimators, the subject of the next section.

AN UNBIASED IMPROPER BAYES ESTIMATOR
In this section, we intend to complement the previous section by showing that, while MLE leads to the inefficientθ M , Bayes estimation leads to a slight improvement overθ LV , so the latter is a reasonably good estimator and the requirement of unbiasedness does not take a heavy toll. However, as will be seen,θ LV is uniformly dominated by another unbiased estimator, one that is not a linear combination of X (1) and X (n) .
Letting ϑ stand for the Bayes-oriented random scale parameter and θ stand for its possible values, assume a smooth (improper) prior density λ(θ ) on ϑ, proportional to θ −a . As far as the posterior density of ϑ is concerned, it is only needed for its support, the feasibility interval The Bayes estimators under this conjugate bounded Pareto family are homogenous of order 1, that is, share with X 1 , X n , θ RB , andθ LV their "respect" for the scale-parameter nature of θ , the property that the distribution ofθ θ is a pivotal quantity, that may depend on n and k but is independent of θ . Accordingly, their bias and standard deviation are constant multiples of θ . We show in the sequel that for the case a = 2, the Bayes estimator (henceforthθ (2) BAYES =θ BAYES for short) is unbiased, simultaneously on n ≥ 2 and k ∈ (0, 1). The joint density of X (1) and The expectation ofθ BAYES can be evaluated for θ = 1 as follows: The proof that E[θ BAYES ] ≡ 1 appears in the Appendix. Figure 1 displays the empirical distribution of the various unbiased estimators of θ for sample size n = 5 from the uniformscale distribution with θ = 1 and k = 0.9; 0.5; 0.1.
The three panels display thatθ RB (solid gray line) is symmetrically distributed around 1, withθ m (two-dash gray line) skewed to the right andθ M (long-dash gray line) skewed to the left. Bothθ LV (solid line) andθ BAYES (thin dashed line) overlap θ M as k ↑ 1 (leftmost panel), and approachθ RB as k ↓ 0 (rightmost panel).
Since generalized Bayes rules need not be admissible, we cannot state from general principles thatθ BAYES has lower variance thanθ LV . However, Figure 2 provides evidence that it does. Figure 2 displays for θ = 1, all k that are multiples of 1 50 and the choices n = 3, 10, 25, 100 of the sample size, the ratio of the empirical variance ofθ BAYES (on 10 8 runs each, using the R software (R Core Team 2014)) to the theoretical variance ofθ LV (16). These graphs are U-shaped, with value 1 toward the endpoints. The larger n is, the stronger the improvement of θ BAYES overθ LV , but seemingly never by as much as 10%.

DISCUSSION
The concepts of minimal sufficiency and completeness of parametric families, at the basis of the theory of statistics, allow for the development of "optimal" statistical methods under proper sufficient conditions. The above example serves to point out the difficulty of exhibiting a "uniformly best estimator" in some settings. The purpose in presenting it is primarily pedagogical.
This article introduced the scale-uniform family of distributions U ((1 − k)θ, (1 + k)θ ) (with unknown mean θ > 0 and a known design parameter k ∈ (0, 1)). This family helps to illustrate the limitations of the Rao-Blackwell improvement when using a sufficient statistic that is minimal but not complete. It also serves to show that the maximum likelihood estimator may be inefficient for finite samples as well as asymptotically. Cases with inefficient MLE for finite samples are easily available: when estimating λ for the exponential distribution, the unbiased estimator ( n−1 n 1 X ) has lower variance than the MLE ( 1 X ). However, an asymptotically inefficient MLE is harder to come by.
The impressive progress in statistics in the last few decades has dimmed the role played by minimal sufficiency and completeness of parametric families in the development of statistical tools. Various nonparametric methods ranging from bootstrap and permutation tests to random forests and deep learning have emerged, playing now prominent roles in statistical practice. Nevertheless, the need for parametric methods continues to surface in this new era. For example, new online machine learning methods are used in cases where the data become available in a sequential fashion. Since these methods need to be scalable for petabytes of information, data compression is essential (such as used by Google, Facebook, etc.). Another example is the use of map-reduce techniques that rely on summary statistics, either for efficiency in distributed computing, or for privacy-preserving properties (e.g., in hospital data, as is currently done in the medical informatics platform of the European Human Brain Project). A last example is of modern health care where the hope is to discover personalized medicines that can best treat the conditions of a specific individual. Such specificity is usually determined through inference, based on finding and analyzing a relatively small homogenous subsample of patients. In all three examples, the methods used often rely on minimal sufficient statistics, and whether this statistic is complete or not has consequences on the "optimal" properties of such methods. Hence, we would argue that while traditional statistical methods may have less of a need for old-school parametric assumptions, this need continues to resurface as we face new challenges in the modern era of massive, online, distributed, private, and personalized data. 1) . Since this conditional expectation is linear, its slope 1 n coincides with the slope of the linear regression that, for dependent and independent variables with equal variances is equal to their correlation coefficient.

A.2 Proof thatθ BAYES is Unbiased
and evaluate the Jacobian as | det ∂(x,y) The integrand of the double integral in (19) is transformed into a function in which the new variable u appears only multiplicatively as u n , and the inner integration over u can be easily performed once the endpoints of u are identified, for fixed v. For this purpose, u is represented in terms of v and x as u = 1−v v x. Then its lower endpoint u * = (1 − k) 1−v v is obtained by the substitution x = 1 − k while its upper endpoint u * = (1 + k)(1 − v) is obtained from the equation resulting from the substitution x = 1 + k − u. The inner integral is where B is the Beta function B(μ, ν) = 1 0 t μ−1 (1 − t) ν−1 dt. Substituting x = w, μ = 2, ν = n − 1, a = 2k 1−k (after its inverse is taken out of the integral), identity (A.5) follows.
The above formula can also be accessed through WolframAlpha.com (LLC 2014).