A new estimator and approach for estimating the subpopulation parameters

Based on some theoretical results, we recommend a new algorithm for estimating the total and mean of a subpopulation variable for the case of a known subpopulation size, which is different from the algorithm recommended by most of sampling books. The latter usually recommend the multiplication of the subpopulation sample mean by the subpopulation size rather than the subpopulation total estimator for the unknown subpopulation size. We present a criterion to determine which estimator is more efficient. The criterion shows that the traditional total subpopulation estimator for unknown subpopulation size will be more efficient if the subpopulation mean is close to zero. Using an innovative procedure, we develop a new estimator, and we study its properties using real data. The new estimator is potentially an appropriate direct estimator in a composite estimator for small area estimation.


Introduction
Sampling methods are essential part of many research studies on a population [1][2][3][4]. However, sometimes study of a population division will become of resear cher's interest. Subpopulation is a term used in the literature of survey sampling to denote partitions or divisions of population units in order to carry out a separate estimation for an underlying variable associated with each unit (e.g. income per person). Estimators are required for a variety of subpopulations, which should influence the choice of design and estimator [5]. During the last two decades, we have been faced with more metapopulations data sets, so that we frequently need methods of analysis for associated subpopulations. Software developers are adding options for this situation [6,7]. Some examples of specific subpopulations are females, high school graduates, Hispanics, public school students, cannabis users or biomarker positive [8,9], and these motivate researchers to develop more efficient designs and estimators of subpopulations. For instance, Rhone et al. [10] provided estimates of distance to the nearest and the third-nearest food store by age, race, ethnicity, income, vehicle access and participation in United States Department of Agriculture's Supplemental Nutrition Assistance Program for both individuals and households. Kimani et al. [11] estimated parameters after subpopulation selection in adaptive seamless trials. Salehi and Chang [12] proposed an unbiased total subpopulation estimator under inverse sampling when the subpopulation size is unknown.
Where the implemented sampling design is Simple Random Sampling Without Replacement (SRSWOR), two situations arise depending on whether the number of units in the subpopulation is unknown or known. When the subpopulation size is unknown, the subpopulation sample mean for the variable values will be an unbiased estimator of the subpopulation mean, and the subpopulation total for the variable values is unbiasedly estimated by another estimator. When the subpopulation size is known, many sampling text books recommend using the subpopulation sample mean multiplied by the subpopulation size as an estimator of subpopulation total (say estimator 1) rather than use the other subpopulation total estimator for unknown size case (say estimator 2). In most of the sampling text books, it is noted that there is a reduction in variance in using estimator 1 (e.g. [13][14][15]). In this paper, we show that this claim is not always correct under certain conditions. We then present a criterion based on sample data that determines whether there will be a reduction in the variance of the mentioned subpopulation estimator or not. The criterion tells us where estimator 1 can be more efficient than estimator 2. Based on these results, we present an algorithm for estimating the mean and total subpopulation that is different from the algorithm recommended in many sampling books.
When the subpopulation size is known and the implemented sampling design is unequal probability sampling design, Särndal et al. [16] also recommended using an estimator in which the known subpopulation size is used in its formula rather than using the estimator of unknown subpopulation size case. From a simulation study, we also show that the estimator of unknown subpopulation size case can be more efficient than the recommended estimator by Särndal et al. [16], particularly for small sample sizes.
Using a transformation technique, we develop a new estimation procedure providing us new estimators of the total and the mean subpopulation. The estimators can be more efficient than the two previous estimators for possibly any population and a wide range of sample sizes. Sample surveys designed to produce estimates for larger areas may also be used to estimate smaller area parameters. Small Area Estimation (SAE) is concerned with how to reliably estimate subpopulation parameters when some small subpopulations and small areas have small samples sizes. There is usually a designbased estimator called direct estimator and a model based estimator called syntactic estimator. Composite estimators are a weighted mean of direct and syntactic estimators, with weights chosen to minimize the Mean Squared Error (MSE) of the estimators (for details cf., Rao and Molina [17]). The new introduced estimators of the mean and the total are more appropriate direct estimators to be used in the composite estimator for SAE, as they are more efficient for smaller sample sizes.
In Section 2, we present the notation and estimators. In Section 3, we derive a criterion that reveals which estimator is more efficient. In Section 4, we present the new algorithm for estimating the mean and total subpopulation, and the result of a simulation study indicating that a similar algorithm can be used for an unequal probability sampling. In Section 5, we introduce new estimators of the total and mean subpopulation.

Notation and estimators
Consider a population that consists of N known units, with the index set of the units being U = {1, 2, . . . , N} with variable of interest y = {y 1 , y 2 , . . . , y N }. Let U h be the index units in subpopulation h with size N h . A sample of size n is selected from the whole population of units using SRS. Let μ = (1/N) N i=1 y i and τ = N i=1 y i be the population mean and total, respectively. Suppose that we want to estimate If N h is unknown, is an estimator of μ h , where n h is the random sample size from U h . Its variance and variance estimator are, respectively, where 2 is the unbiased sample variance of subpopulation h. An unbiased estimator of τ h is given by To derive its variance, y * i is defined as follows: The variance ofτ h is given by where where  [15] showed that V(τ h ) is approximately smaller than V(τ h ). He showed that where C h is the coefficient of variation of subpopulation h. Figure 1 presents the flowchart for estimating the subpopulation mean and total recommended by sampling text books. For known N h case, we will show that τ h and μ h = τ h /N h can be more efficient thanτ h andμ h , respectively.

Criterion for comparing the estimators
Since the comparison of the total estimators are quite similar to the comparison of the mean estimators, we will focus only on the total estimators. We present a criterion to compare the variances of τ h and τ h . We first note that Then, expanding the quadratic, we obtain after some algebra the following lemma.
Proof: Now n h has the Hypergeometric distribution with mean nN h /N and variance For X = n h , using a second-order Taylor series approximation for g(X) = 1/X, we have since the third-order term is negative. Hence, by first conditioning on n h and then taking an expectation with respect to n h , we get From (2), By the Lemma and Equation (5), we have In (10), the variance V( τ h ) is less than or equal to the last equation, as we have ( (9) and (10), we have Since the first term of the right-hand side of (11) is pos- holds, the efficiency of τ h is greater than that of τ h . We may rewrite the criterion (12) as Criterion (13) tell us thatτ h is the preferable estimator if (1) the subpopulation mean is close to zero or/and (2) the subpopulation variance is large or/and (3) The expected sample size from subpopulation h is small.
However, the criterion (12) depends on the subpopulation parameters that are not available. We may plug in unbiased estimators of μ 2 h and σ 2 h into the criterion (12) to get a new criterion that depends on the sample only.
for estimating μ 2 h and σ 2 h , respectively. Instead of (12), we may then use the following criterion This leads toȳ Using (8) and some algebraic simplifications, we can therefore use the following criterion in practice: (15) Thus, if (15) holds for a selected sample, we can use τ h to estimate τ h , otherwise τ h will be used. Note that if we want to decide on using either μ h or μ h , we should use exactly the same criteria (15) and (12).

New algorithm
After we have the observations for a selected sample, if the criterion (15) is satisfied, we will, respectively, estimate τ h and μ h by τ h and μ h . Otherwise, we would estimate them byτ h andμ h , where N h is known. If N h is unknown, we will estimate τ h and μ h by τ h andμ h , respectively. Figure 2 presents a new teaching flowchart for estimating the subpopulation mean and total.   This means that there is not only no gain in efficiency due to knowing N h , but also the estimated loss in efficiency will be 17%. Furthermore, we suggest usingμ h rather thanμ h in such a case; see Figure 2.
As we mentioned in Section 3,μ h andτ h are more efficient than their counterparts when the expected sample size from subpopulation h is small. We now discuss how small the sample size, n, should be such that V( τ h ) becomes smaller than V(τ h ) in this example. We can either use criterion (12) or we can directly use the variances. In Figure 3 (a), we draw the graph of μ 2 h − N nN h σ 2 h against the sample size, n. When the graph is negative, V( τ h ) is smaller than V(τ h ). The graph shows that the variance of τ h is smaller than the variance ofτ h for n ≤ 362. In Figure 3 (b), we draw the graph of V( τ h ) − V(τ h ) against the sample size, n. This graph shows that the variance of τ h is smaller than the variance ofτ h for n ≤ 362 according criterion (12).
for n ≤ 414. The example shows that the inequality in criterion (12) can become even sharper.
From (12), V( τ h ) is smaller than V(τ h ) for a reasonable sample size, where the population mean is close to zero like the above example. To show that the mentioned results can be extended to unequal probability sampling, we ran a simulation study on the Census Agriculture in United States, where n counties were selected with replacement proportional to their acreage devoted to farming in 1982. We wish to estimate the changes in total acreage devoted to farming from 1987 to 1992 in the western region. For unknown N h , Särndal et al. [16, pp. 390] recommended using where ν is the number of distinct counties in the sample, y * i is the same as (4) in Section 2, π i = 1 − (1 − p i ) n , and p i is the acreage devoted to farming in the ith county divided by the total acreage devoted to farming in USA in 1982. For known N h , Särndal et al. [16] recommended usingτ where ν h is the number of distinct counties selected from subpopulation h, andÑ h = ν h i=1 1/π i . In the simulation study, we computed the relative efficiency ofτ hπ , which is as follows: The relative efficiencies are simulated for n from 50 to 2000. which are given in Figure 4. All relative efficiencies are greater than 1, andτ hπ is more efficient for small sample sizes. Thus, if we know N h , using it in the estimator does not necessarily lead to a more efficient estimator. In the next section, we introduce a procedure that makes V( τ h ) smaller than V(τ h ) for any population. The relative efficiencies are greater than 1 for all sample sizes andτ hπ was more efficient estimator, particularly for smaller sample sizes.

New estimators of subpopulation total and mean
Unfortunately criterion (12) is not appropriate for a wide range of populations. We now introduce an estimation procedure that makes criterion (12) appropriate for any population with a reasonable sample size range. This estimation procedure will lead to new estimators for the subpopulation total and subpopulation mean. The procedure is as follows: (1) We first need to have a reasonable guess at the subpopulation mean. This guess can be a rough estimate related to a similar survey, a previous survey, a pilot survey, the mean of an auxiliary variable or can be purely based on an expert's idea. Let this guess be c. (2) We then define a transformed variable of z i = y i − c, for i ∈ U. (3) Let define z * i as follows: An unbiased estimator of τ zh = N h i=1 z i is given by, The variance of τ zh is given by where σ * 2 where s * 2 Thus, another estimator of τ h is This is justτ h , plus an adjustment factor of (N h − (Nn h /n))c. From (19), the variance ofτ h and its variance estimator are the same as those of the transformed variable of z, namely To evaluate the introduced estimation procedure, we use the Census Agriculture data. We wish to estimate the total acres devoted to farming in 1992 in the western states, where the sample sizes are n = 20, 30, 40, 50, 75, 100, 125, . . . , 500, 600, . . . , 1000, 1500, 2000, 2500, 3000, 3075. Let guesses on the subpopulation mean be c = 560000, 600000,640000, 680000, 720000, 735598.62, 760000, 800000. A very natural guess on the subpopulation mean can be 735598.62 in this example, which is the known subpopulation mean in 1987. Using (5) and (21), V( τ h ) and V(τ h ) are computed for each case. We define the relative efficiency of τ h as follows: as both estimator are unbiased. The relative efficiencies are computed and the results are given in Table 1. The mean acres devoted to farming in 1992 in the western states was μ h = 723343.96. With a very rough guess of 560,000 acres, μ h ,τ h can estimate the western states total better than τ h for sample sizes of less than or equal to 375. If the guess is 600,000, the relative efficiency is more than 1 for sample sizes of less than or equal to 1000. If the guess is between 640,000 and 740,000,τ h is more efficient estimator than τ h for any sample size. The subpopulation mean in 1987 was 735598.62 and if we use it as c, the variance ofτ h would be smaller than the variance of τ h for any sample size. We may even recommend the use of to estimate μ h as it outperforms μ h =ȳ h , where n ≤ 1000 for any c between 640,000 and 800,000. Its variance and variance estimators are,

Conclusion
Despite the popular impression from the literature, we have shown thatτ h andμ h can be more efficient estimators thanτ h andμ h , respectively, where the subpopulation size is known. A criterion was developed that can be used to choose more appropriate estimators.
The new estimators are introduced to estimate the subpopulation mean and total for a known subpopulation size situation. Criterion (13) for the new estimator is for which μ zh is very close to zero for an appropriate c. Furthermore, the new estimators are more efficient when E(n h ) is small. This can be interpreted that the new estimators would be potentially appropriate direct estimators to be used in composite estimators for Small Area Estimation.