Small area estimation using reduced rank regression models

Abstract Small area estimation techniques have got a lot of attention during the last decades due to their important applications in survey studies. Mixed linear models and reduced rank regression analysis are jointly used when considering small area estimation. Estimates of parameters are presented as well as prediction of random effects and unobserved area measurements.


Introduction
Small area estimation is an active research area in statistics with many applications. Today there exist several approaches to deal with small area problems. For recent contributions to the topic when considering discrete data (e.g. see Chandra et al. 2018), and for continuous data (see Baldermann 2018). In both articles many works of other authors are presented and discussed. For those who want to get a solid introduction to the subject the book (Rao and Molina 2015) can be recommended. Usually a common basis for small area estimation problems is that a survey study (finite population case) has been conducted and based on the survey one intends to extract information concerning small domains by adding specific local information which was not accounted for in the comprehensive survey.
In this article a case is considered where units of a survey sample are investigated several times, leading to the existence of dependent observations. It is assumed that observed background information exist which then is implemented via a linear model, giving a model which includes variance components. Moreover, it is assumed that latent processes exist which often show up when different types of systems are studied, for example when weather impact is studied. In this case often a huge number of variables are studied but we do not know the effects of the individual variables, for example we cannot directly relate temperature to plant growth. This leads us to exploit reduced rank regression models. Thus, the model which will be discussed in this article will include time dependent fix effects, i.e. structured mean responses, random parameters and rank restrictions on parameters, which to our knowledge is a new type of model. In Section 2 more motivating details are given, including basic references, and in Section 3 the assumptions are summarized together with all technical details. Briefly, in Section 4 the estimation of all unknown parameters is considered and finally two prediction results are presented in Section 5.

Background
In this section, via an example, we will introduce our way of thinking about extracting information from a survey sample. In particular, a number of different sources of uncertainty are presented which then will be included in a statistical model. It turns out that the model will comprise three different types of uncertainty. Section 3 includes all details of the model and in the subsequent section likelihood-based estimators are derived. In the last section these estimators are utelized when finding predictors.
Example 2.1. Let there be a national survey which collects data about some specific production from a certain type of companies. The survey is followed-up once per year and it is repeated, say, five years.
Altogether there exist 20 regions and each region consists of 10 subregions. The survey sample (units) consists of four companies from each subregion. Thus, in the survey, in total 800 companies are followed five years. With the models presented in this article the aim is to obtain information about the companies in the subregions. Note that in reality the companies should be classified into strata and then at least one company should be sampled from each stratum and each subregion. To simplify the presentation we will only have one stratum. However, in each subregion we assume that there is a lot of background information which we would like to take into account when predicting production in subregions or predicting production of unobserved units.
Small area models, also called small domain models, can be classified into two categories. One category is the area-level models category (Fay and Herriot 1979) and the other category is the unit-level models category (Battese et al. 1988). Both model categories are discussed in detail in a well-written book (Rao and Molina 2015). In this article we are thinking of area levels but if unit-level information would be available the proposed model can also be used.
Example 2.1 indicates the following use of notation. Let h ij ; i ¼ 1; ; :::; m; j ¼ 1; 2; :::; n i ; n ¼ X m i¼1 n i be the direct estimates from the survey for the jth subarea within the ith area. The estimates are based on the survey design where units have been sampled on the subarea level. Note that we will not discuss the underlying survey in this article, only use the outcome of the survey, i.e. the direct estimators of the subareas. Statistics is partly about quantifying uncertainty. Often this is carried out by assuming that an estimate is a representation of an estimator which follows some distribution. However, we also have cases where uncertainty is not described via distributions. Examples are rounding errors, specific types of truncations, identification of influential observations, outlier detection, model comparisons, the number of observations in a sample, etc. One choice of distribution for the estimatorĥ ij would be the one generated by the survey design, i.e. how many and under what model the units have been sampled would determine the distribution. However, for our purposes we have difficulties to work with the probability space generated by the survey design. Instead, we now think of the direct estimatesĥ ij being misspecified quantities since the survey sample distribution does not take into account those covariables we are going to use in the adjustment ofĥ ij : The covariables can be used to correct for bias as well as diminish the uncertainty in the estimate. For example, returning to our example, if the survey design does not take into account the size of companies small companies can be overrepresented in the sample and then h ij is wrongly estimated.
In order to derive an applicable model lets start with where for notational convenienceĥ ij has been replaced by y ij . The joint distribution for f ij g is specified via where ¼ ð 11 ; :::; 1;n 1 ; :::; m;n m Þ is a row vector, V is supposed to be known and determined by the survey design. Often V is solely a function of the number of sampled units. Moreover, is defined on another probability space than the one generated by the survey design. However, the uncertainty in the direct estimates, due to the survey, is incorporated in the dispersion of : Furthermore, one basic assumption in this article is that some of the h ij will be the same, i.e. the survey estimators are the same for clusters of subregions. To make such an assumption is often realistic but has to be checked in practise. Under this assumption, instead of (2.1), the model can be written in matrix notation where y ¼ ðy 11 ; :::; y 1;n 1 ; :::; y m;n m Þ; ðh 11 ; :::; h 1;n 1 ; :::; h m;n m Þ ¼ b 0 C; and b : k Â 1 consists of unknown parameters and C: k Â n is a usual design matrix describing the relations among the elements fh ij g: Usually C consists of blocks of one-vectors but in principle it can be arbitrary as long as it fits together with the covariate structures which will be presented later. For example, with three areas and four subareas in each area C equals 2). The model means that a probability space is only connected to the modeling of fĥ ij g; i.e. the set of estimated h ij , but we include in the model extra variation due to the survey design.
In the introductory example it was stated that the survey was repeated five times. When survey studies are repeated it will for this article be assumed that when there exists a linear trend it is of the same form for all subareas. In this case there will be a matrix Y where each row in the matrix follows Equation (2.3), besides an unknown scalar multiplier. Thus, Equation (2.3) implies where A is a design matrix which models p repeated surveys (within subarea design matrix), B is a matrix of unknown parameters and E$N p;n ð0; R; VÞ; i.e. E is matrix normally distributed with a separable dispersion structure D½E ¼ D½vecE ¼ V R; where stands for the Kronecker product and R is an unknown positive definite matrix. The parameter R is used to model the dependency due to the repeated measurement which exist because the observed units of the design have been repeatedly measured. No particular structure in R is assumed since it would only make sense if we would have precise knowledge about the mean structure, which is not the case in this article. For example, it does not make sense to assume an autocorrelated structure if the mean has not been established. The model in Equation (2.4) is often called Growth Curve model, GMANOVA and recently bilinear regression model (see von Rosen 2018, for basic results and references).
As an example of A we can have meaning that there is a second order trend which is observed over five years. When expressing h ij as a linear function of an unknown parameter vector b; as it was done in Equation (2.3), we usually cannot say that the model is completely true for all subareas. Instead we believe that there is some random variation around a true imaginary model, i.e. we introduce a variance component. Thus, instead of Equation (2.4), the following model is set up: where U$N p;l ð0; R u ; I l Þ is independent of E and Z is related to C because, in the model, usually via C, the observations are linked to the subareas and Z helps to describe the random variation (errors) in the subareas. Therefore, formally, CðZ 0 Þ CðC 0 Þ; where CðÞ denotes the column vector space. Additionally we will standardize these random effects by assuming ZV À1 Z 0 ¼ I l ; meaning that it has been adjusted for a different number of subareas within an area, taking into an account the uncertainty in the survey estimates. This has two important implications. One is that it simplifies the mathematical treatment of the model and the second is that it becomes easier to compare predicted random effects. Now observed covariables will be introduced in the model. There exist different types of covariables but we are not going to distinguish between different types. For example one type of covariables are those which are accounted for in the survey design (e.g. variables defining different strata) and another type can be covariables observed in registries and a third type can be baseline data which were available when a survey study with repeated measurements on units was initiated. The effect of the covariate will be modeled by a term B 1 C 1 where B 1 stands for the unknown effect and C 1 collects the observed covariables leading to the model Moreover, suppose that there are a number of latent variables. It can be large clusters of variables, e.g. weather variables, socio-econometric variables, chemical characteristics of soil and water, etc. A typical phenomena of these clusters are that one knows that there should be an effect on the survey estimates but it is difficult to express this via some functional relationship. Instead we will model the relationship via rank restrictions on parameter matrices and will consider the following model: where F: t Â n is known and W is unknown of rank rðWÞ ¼ r minðp; tÞ: The idea is that all clustered variables appear in F and these variables are governed by r unobserved latent variables (processes) which are incorporated in the model via the rank restrictions on W: It is important to note that p has to be larger than r meaning that modeling the number of underlying latent variables depends on how many times the survey has been repeated. Thus, we really see an advantage of performing repeated surveys when latent variables are supposed to exist, which indeed we believe is a common phenomena.

Detailed model specification
Let Y consist of the direct estimates of the subareas where each column corresponds to a subarea including p estimators from the p repeated measurements. When performing likelihood inference the direct estimates are used but in the presentation we will not distinguish between estimators and estimates.
Definition 3.1. Let where A: p Â q, C: k Â n, C 1 : k 1 Â n; F: t Â n, are all known matrices, CðF 0 Þ CðC 0 Þ; rðWÞ ¼ r<minðp; tÞ; Z: lÂ n, p l<n; ZV À1 Z 0 ¼ I l ; U$N p;l ð0; R u ; I l Þ; E$N p;n ð0; R e ; VÞ where U and E are independently distributed. The parameters R u and R e are supposed to be unknown and positive definite whereas V is known and determined from the survey. When in Definition 3.1 B 1 C 1 ¼ 0; WF ¼ 0 and UZ ¼ 0 then we have the classical growth curve model (GMANOVA) (see Potthoff and Roy 1964;von Rosen 2018). When WF ¼ 0 and UZ ¼ 0 then the growth curve model with covariate information appears which sometimes is identified as a mixture of the GMANOVA and MANOVA models. Some references to this model, as well as more general models (see Chinchilli and Elswick 1985;Verbyla and Venables 1988;von Rosen 1989;and Bai and Shi 2007). If B 1 C 1 ¼ 0 and WF ¼ 0 we say that we have the growth curve model with random effects (see Ip et al. 2007) whereas if only WF ¼ 0 holds (see Yokoyama and Fujikoshi 1992;Yokoyama 1995) where similar models are considered and where references to earlier works can be found. However, usually in these works one puts structures on the covariance matrices which will not be discussed in this article.

Estimation of parameters
Let Q be a matrix of basis vectors such that CðQÞ ¼ CðV À1=2 C 0 : V À1=2 C 0 1 : V À1=2 Z 0 Þ; where V 1=2 is a symmetric square root of V in E$N p;n ð0; R; VÞ: Further, let Q o be any matrix of full rank satisfying CðQ o Þ ¼ CðC 0 : C 0 1 : Z 0 Þ ? : We assume Q 0 Q ¼ I v ; where v ¼ rðC 0 : C 0 1 : Z 0 Þ; v > l. A one-one transformation of the model in Definition 3.1, using Q, yields (4.8) The identity in Equation (4.8) is postmultiplied by C leading to the model and this leads to that the model in Equation (4.10) will be split into two models. Let C ¼ ðC 1 : C 2 Þ : v Â l; v Â ðvÀlÞ: Then we have three models which will be used when finding estimators: (i) YV À1=2 QC 1 ¼ ABCV À1=2 QC 1 þ B 1 C 1 V À1=2 QC 1 þ WFV À1=2 QC 1 þUZV À1=2 QC 1 þ EV À1=2 QC 1 ; UZV À1=2 QC 1 $ N p;l ð0; R u ; I l Þ; EV À1=2 QC 1 $ N p;l ð0; R e ; I l Þ; (4.11) The idea is to utilize Equations (4.12) and (4.13) when estimating B, B 1 ; W and R e : To estimate R u these estimators are inserted in Equation (4.11). Suppose that the esti-matorsB;B 1 ;Ŵ andR e have been obtained, and let Under the assumption of no randomness inB;B 1 andŴ; Here it is assumed that the difference is positive definite. IfR u is not positive definite, which should occur with some positive probability, this indicates that data does not support the model presented in Definition 3.1 or the estimation procedure has some deficiencies. It can be noted that the problem with negative variance components in mixed linear models has a long history and an early reference is Nelder (1954). A more recent contribution to the discussion of the problem is provided by Molenberghs and Verbeke (2011). If we would start with the model in Equation (2.5) and only consider D½Y ¼ ZZ 0 R u þ V R e we could avoid to assume R u to be positive definite. However, in our model derivation, via U$N p;l ð0; R u ; I l Þ; R u has to be positive definite. Thus, if obtaining a non-positive definiteR u we have to either reformulate the model or change the estimator. One way to modifyR u is to work with the eigenvalues ofR u : If there are a few "small" (by absolute value) non-positive eigenvalues these eigenvalues can be replaced by small positive quantities. This should of course take place with caution and common sense but as far as we know there is no general recipe of how to handle a non-positive definiteR u : To estimateB;B 1 ;Ŵ andR e Equations (4.12) and (4.13) are merged: (4.14) whereẼ$N p;nÀl ð0; R e ; I nÀl Þ: Put Equation (4.14) is identical to which is an extended Growth Curve model (GMANOVA þ MANOVA) with a reduced rank regression component. Furthermore, the likelihood function which corresponds to the model in Equation (4.15) equals L B; B 1 ; W; R e ð Þ¼ 2p ð Þ À 1 2 p nÀl ð Þ jR e j À 1 2 nÀl ð Þ Â exp À 1 2 tr R À1 e X À ABD 1 À B 1 D 2 À WD 3 ð Þ ðÞ 0 È É ; (4.16) where we have used the convention that instead of ðHÞðHÞ 0 it is written ðHÞðÞ 0 for any arbitrary matrix expression H. Our first observation is that the likelihood in Equation (4.16) is smaller or equal to 2p ð Þ À 1 2 p nÀl ð Þ jR e j À 1 with equality if and only if which, under some full rank conditions on D 2 ; determines B 1 as a function of B and W: Thus, if B and W can be estimated the covariate effect described by B 1 D 2 can be estimated. The density in Equation (4.17) corresponds to the model which is a Growth Curve model with latent variables (reduced rank effect). The model has been considered (Reinsel and Velu 1998;von Rosen and von Rosen 2017), where also other references can be found. Now it is shortly described how B and W can be estimated. For notational conveniences the model in Equation (4.18) is written Note that CðD 0 3 Þ CðD 0 1 Þ because CðF 0 Þ CðC 0 Þ which is essential for being able to obtain explicit estimators.
Let The upper bound of the likelihood function in Equation (4.20) is well known (see Srivastava and Khatri 1979, Theorem 1.10.4): (4.21) and in Equation (4.21) equality holds if and only if ðnÀlÞR e ¼ W: Hence, if B and W are estimated then R e is also estimated. Maximizing the log-likelihood function is equivalent to minimizing the determinant jWj which now will take place. It is worth noting that it is used, since rðWÞ ¼ r; that the matrix W can be factored into two terms, i.e. W ¼ W 1 W 2 ; where W 1 and W 2 are of size p Â r and size r Â t, respectively. Depending on the knowledge about the model W 1 and W 2 can be interpreted but there are also cases where the factorization has no clear interpretation.
Following the approach presented in (Kollo and von Rosen 2005, Chapter 4 (4.23) and equality in Equation (4.22) holds if and only if ABD 1 ¼ P A;S 1 ðXPD0 1 ÀW 1 W 2D3 Þ: Throughout the article it is supposed that S 1 is positive definite which holds with probability 1. Thus, B as a function of W ¼ W 1 W 2 can be obtained. Moreover, (4.25) where (4.26) and the equality in Equation (4.25) holds if and only if Given W 1 ; the linear system of equations in Equation (4.27) is consistent and hence W 2 can always be estimated as a function of W 1 : (4.28) under the assumption that rðD 3 Þ ¼ k 1 and CðAÞ \ CðW 1 Þ ¼ f0g: The only parameter in the right-hand side of Equation (4.25) which is left to estimate is W 1 : The right-hand side of Equation (4.25) can be factored as with H : p Â rðAÞ is such that T 0 1 S À1 2 T 1 ¼ HH 0 ; M : ðpÀrðAÞÞ Â r is semi-orthogonal of rank r and U is a positive definite matrix. The square root in M is supposed to be symmetric. Due to the Poincar e separation theorem (see Rao 1979) where k 1 ! Á Á Á ! k pÀrðAÞ are the eigenvalues of U which do not depend on any unknown parameter. The lower bound is then attained if M consists of the corresponding eigenvectors v 1 ; :::; v r : LetM ¼ ðv 1 ; :::; v r Þ: Therefore, a maximum likelihood estimator of W 1 is found if we can find W 1 satisfying the following equality: (4.29) SinceM 0M ¼ I r ; the following choice of W 1 is appropriate: There remains an unresolved problem of presenting all solutions to Equation (4.29), but this is not considered here.
Together with Equation (4.28) this yields thatŴD 3 ¼Ŵ 1Ŵ : Now, the obtained results will be stated in the following theorem.
Theorem 4.1. Let the direct estimates of a survey follow the model given in Definition 3.1. Moreover, all matrices in the statements have earlier been defined in this section.

Prediction
Small area estimation is very often about prediction of unobserved units (subareas). For our model, due to normality assumptions, it is relatively straight forward to construct predictors, in particular if the explicit estimators of the previous section are utilized. The strategy will be to first predict U which will be based on the observed data and thereafter the unobserved units are predicted. For prediction of U the conditional mean E½UjY is crucial and thus the joint distribution of U and Y will be derived. According to Definition 3.1, U and E are independently distributed ðvecU; vecYÞ is normally distributed, i.e. where Thus, vec YÀABCÀB 1 C 1 ÀWF ð Þ and the next proposition can be stated.
Proposition 5.1. Let the model be given in Definition 3.1 and use the notation introduced earlier in Section 4.
i. The predicted valueÛ is given by whereR u ;R e ; ABC;B 1 C 1 andŴF are presented in Theorem 4.1 and Y consists of subarea estimates.
ii. Let Y o consist of the unobserved direct estimates, corresponding to the non-sampled units. The corresponding covariables are available which are denoted C o ; C 1o ; F o and Z o : Then where vecÛ is given in (i).

Funding
This research has been supported by The Swedish Foundation for Humanities and Social Sciences (P14-0641:1) and The Swedish Natural Research Council (2017-03003).