Asymptotic theory for maximum likelihood estimates in reduced-rank multivariate generalized linear models

ABSTRACT Reduced-rank regression is a dimensionality reduction method with many applications. The asymptotic theory for reduced rank estimators of parameter matrices in multivariate linear models has been studied extensively. In contrast, few theoretical results are available for reduced-rank multivariate generalized linear models. We develop M-estimation theory for concave criterion functions that are maximized over parameter spaces that are neither convex nor closed. These results are used to derive the consistency and asymptotic distribution of maximum likelihood estimators in reduced-rank multivariate generalized linear models, when the response and predictor vectors have a joint distribution. We illustrate our results in a real data classification problem with binary covariates.


Introduction
The multivariate multiple linear regression model for a q-dimensional response Y = (Y 1 , . . . , Y q ) T and a p-dimensional predictor vector X = (X 1 , . . . , X p ) T postulates that Y = BX + , where B is a q × p matrix and = ( 1 , . . . , q ) T is the error term, with E( ) = 0 and var( ) = . Based on a random sample (X 1 , Y 1 ), . . . , (X n , Y n ) satisfying the model, ordinary least squares estimates the parameter matrix B by minimizing the squared error loss function n i=1 Y i − BX i 2 to obtain Reduced-rank regression introduces a rank constraint on B, so that Equation (1) is minimized subject to the constraint rank(B) ≤ r, where r < min(p, q). The solution isB T RRR =B T ols U r U T r , where U r are the first r singular vectors ofŶ T =B ols X T (see, e.g. [1]).
Reduced-rank regression has attracted attention as a regularisation method by introducing a shrinkage penalty on B. Moreover, it is used as a dimensionality reduction method as it constructs latent factors in the predictor space that explain the variance of the responses.
Anderson [2] obtained the likelihood-ratio test of the hypothesis that the rank of B is a given number and derived the associated asymptotic theory under the assumption of normality of Y and non-stochastic X. Under the assumption of joint normality of Y and X, Izenman [3] obtained the asymptotic distribution of the estimated reduced-rank regression coefficient matrix and drew connections between reduced-rank regression, principal component analysis and correlation analysis. Specifically, principal component analysis coincides with reduced-rank regression when Y = X [4]. Recently, Fan et al. [5] studied the theoretical properties of nuclear norm regularized maximum likelihood type estimates in a class of models that includes reduced-rank regression. They derive statistical rates of convergence in possibly high-dimensional scenarios. However, they do not provide asymptotic results that can be used to construct confidence intervals or hypothesis tests. The monograph by Reinsel and Velu [1] contains a comprehensive survey of the theory and history of reduced rank regression, including in time series, and its many applications.
Despite the application potential of reduced-rank regression, it has received limited attention in generalized linear models. Yee and Hastie [6] were the first to introduce reduced-rank regression to the class of multivariate generalized linear models, which covers a wide range of data types for both the response and predictor vectors, including categorical data. Multivariate or vector generalized linear models (VGLMs) is the topic of Yee's [7] book, which is accompanied by the associated R packages, VGAM and VGAMdata. Yee and Hastie [6] proposed an alternating estimation algorithm, which was shown to result in the maximum likelihood estimate of the parameter matrix in reduced-rank multivariate generalized linear models by Bura et al. [8]. Asymptotic theory for the restricted rank maximum likelihood estimates of the parameter matrix in multivariate GLMs has not been developed yet.
In general, a maximum likelihood estimator is a concave M-estimator in the sense that it maximizes the empirical mean of a concave criterion function. Asymptotic theory for M-estimators defined through a concave function has received much attention. Huber [9], Haberman [10] and Niemiro [11] are among the classical references. More recently, Hjort and Pollard [12] presented a unified framework for the statistical theory of M-estimation for convex criterion functions that are minimized over open convex sets of a Euclidean space. Geyer [13] studied M-estimators restricted to a closed subset of a Euclidean space.
The rank restriction in reduced rank regression imposes constraints that have not been studied before in M-estimation as they result in neither convex nor closed parameter spaces. In this paper we (a) develop M-estimation theory for concave criterion functions, which are maximized over parameters spaces that are neither convex nor closed, and (b) apply the results from (a) to obtain asymptotic theory for reduced rank regression estimators in generalized linear models. Specifically, we derive the asymptotic distribution and properties of maximum likelihood estimators in reduced-rank multivariate generalized linear models where both the response and predictor vectors have a joint distribution. The asymptotic theory we develop covers reduced-rank regression for linear models as a special case. We show the improvement in inference the asymptotic theory offers via analysing the data set Yee and Hastie [6] analysed.
Throughout, for a function f : R q → R, ∇f (x) denotes the row vector ∇f (x) = (∂f (x)/∂x 1 , . . . , ∂f (x)/∂x q ),ḟ (x) stands for the column vector of derivatives, while ∇ 2 f (x) denotes the symmetric matrix of second-order derivatives. For a vector valued function f : R q 1 → R q 2 , ∇f denotes the q 2 × q 1 matrix,

M-estimators
Let Z be a random vector taking values in a measurable space Z and distributed according to the law P. We are interested in estimating a finite dimensional parameter ξ 0 = ξ 0 (P) using n independent and identically distributed copies Z 1 , Z 2 , . . . , Z n of Z. In the sequel, we use Pf to denote the mean of Let be a subset of a Euclidean space and m : Z → R be a known function. Assume that the parameter of interest ξ 0 is the maximizer of the map ξ → Pm ξ defined on . One can estimate ξ 0 by maximizing an empirical version of the optimization criterion. Specifically, given a known function where, here and throughout, P n denotes the empirical mean operator P n V = n −1 n 1=1 V i . Hereafter, M n (ξ ) and P n m ξ will be used interchangeably as the criterion function, depending on which of the two is appropriate in a given setting.
Assume that ξ 0 is the unique maximizer of the deterministic function M defined in Equation (2). An M-estimator for the criterion function M n over is defined aŝ If the maximum of the criterion function M n over is not attained but the supremum of M n over is finite, any valueξ n that almost maximizes the criterion function, in the sense that it satisfies for A n small, can be used instead.
Definition 2.1: An estimatorξ n that satisfies Equation (4) with A n = o p (1) is called a weak Mestimator for the criterion function M n over . When A n = o p (n −1 ),ξ n is called a strong M-estimator.
Proposition A.1 in the Appendix lists the conditions for the existence, uniqueness and strong consistency of an M-estimator, as defined in Equation (3), when M n is concave and the parameter space is convex. Under regularity conditions, as those stated in Theorem 5.23 in van der Vaart [14], the asymptotic expansion and distribution of a consistent strong M-estimator [see Definition 2.1] for ξ 0 is given by where and V ξ 0 is the non-singular symmetric second derivative matrix of M(ξ ) at ξ 0 .

Restricted M-estimators
We now consider the optimization of M n over res ⊂ , where res is the image of a function that is not necessarily injective. Specifically, we restrict the optimization problem to the set res by requiring:

Condition 2.2:
There exists an open set ⊂ R q and a map g : → such that ξ 0 ∈ g( ) = res .
Even when an M-estimator for the unrestricted problem as defined in (3) exists, there is no a priori guarantee that the supremum is attained when considering the restricted problem. Nevertheless, Lemma 2.3 establishes the existence of a restricted strong M-estimator, regardless of whether the original M-estimator is a real maximizer, weak or strong. All proofs are provided in the Appendix. We derive next the asymptotic distribution ofξ res n = g(θ n ), withθ n ∈ . The constrained estimatorξ res n is well defined under Lemma 2.3, even whenθ n is not uniquely determined. Ifθ n were unique and √ n(θ n − θ) had an asymptotic distribution, one could use a Taylor series expansion for g to derive asymptotic results forξ res n . Building on this idea, Condition 2.5 introduces a parametrization of a neighbourhood of ξ 0 that allows applying standard tools in order to obtain the asymptotic distribution of the restricted M-estimator. Under the setting in Condition 2.5, we will prove thatŝ n = h −1 (ξ res n ) is a strong M-estimator for the criterion function P n m h(s) over S. Then, we can apply Theorem 5.23 of van der Vaart [14] to obtain the asymptotic behaviour ofŝ n , which, combined with a Taylor expansion of h about s 0 , yield a linear expansion forξ res n . Finally, requiring Condition 2.6, which relates the parametrizations (S, h) and ( , g), suffices to derive the asymptotic distribution ofξ res n in terms of g. Condition 2.6 ensures that T ξ 0 = span∇g(θ 0 ) is well defined regardless of the fact that g −1 (ξ 0 ) may contain multiple θ 0 's. Moreover, T ξ 0 also agrees with span∇h(s 0 ). Consequently, the orthogonal projection ξ 0 ( ) onto T ξ 0 with respect to the inner product defined by a symmetric positive definite matrix satisfies where A † denotes a generalized inverse of the matrix A. The gradient of g is not necessarily of full rank, in contrast to the gradient of h. Note that ξ 0 ( ) is idempotent ( 2 ξ 0 ( ) = ξ 0 ( ) ) and the span of its columns is equal to T ξ 0 . However, ξ 0 ( ) is not in general symmetric, but rather self-adjoint with respect to the inner product induced by since < x, y > := y T x. Proposition 2.7: Assume that Conditions 2.2, 2.5 and 2.6 hold. Assume also that the unrestricted problem satisfies the regularity Conditions A.2-A.4 in the Appendix. Then, any strong M-estimatorξ res n of the restricted problem that converges in probability to ξ 0 satisfies where is the influence function of the unrestricted estimator defined in Equation (5), V ξ 0 is the non-singular symmetric second derivative matrix of M(ξ ) at ξ 0 , and ξ 0 (−V ξ 0 ) is defined according to Equation (6).
Moreover, √ n(ξ res n − ξ 0 ) is asymptotically normal with mean zero and asymptotic variance As an aside remark, we conjecture that for estimators that are maximizers of a criterion function under restrictions that satisfy Conditions 2.2-2.6, when Equation (5) is true, Equation (7) also holds. This can be important since the asymptotic distribution of restricted estimators will be derived directly from the asymptotic distribution of the unrestricted one.
The optimization problem that defines the restricted M-estimators considered in this paper is, in general, not convex and hence difficult to solve. However, in the case of maximum likelihood estimation in reduced rank multivariate generalized linear models, the main application of the results in our paper, efficient algorithms exist for solving it (see [6] and the VGAM R package).

Asymptotic theory for the maximum likelihood estimator in reduced rank multivariate generalized linear models
In this section we show that maximum likelihood estimators in reduced-rank multivariate generalized linear models are restricted strong M-estimators for the conditional log-likelihood. Using results in Section 2.1, we obtain the existence, consistency and asymptotic distribution of maximum likelihood estimators in reduced-rank multivariate generalized linear models. In practice, these estimators can be obtained using the R package VGAM, developed by Yee [15].

Exponential family
Let Y = (Y 1 , . . . , Y q ) T be a q-dimensional random vector and assume that its distribution belongs to a k-parameter canonical exponential family with pdf (pms) where T(y) = (T 1 (y), . . . , T k (y)) T is a vector of known real-valued functions, h(y) ≥ 0 is a nonnegative known function and η ∈ R k is the vector of natural parameters, taking values in where the integral is replaced by a sum when Y is discrete. The set H of the natural parameter space is assumed to be open and convex in R k , and ψ a strictly convex function defined on H. Moreover, we assume ψ(η) is convex and infinitely differentiable in H. In particular, where ∇ 2 ψ is the k × k matrix of second derivatives of ψ. Since ψ is strictly convex, var η (T(Y)) is non-singular for every η ∈ H.

Multivariate generalized linear models
Let Z = (X, Y) be a random vector, where now Y ∈ R q is a multivariate response and X ∈ R p is a vector of predictors. The multivariate generalized linear model postulates that the conditional distribution of Y given X belongs to some fixed exponential family and hypothesizes that the k-vector of natural parameters, which we henceforth call η x to emphasize the dependence on x, depends linearly on the vector of predictors. Thus, the pdf (pms) of Y | X = x is given by where η x ∈ R k depends linearly on x.
Frequently, a subset of the natural parameters depends on x, whereas its complement does not. The normal linear model with constant variance is such an example. To accommodate this structure, we partition the vector η x indexing model (12) into η x1 and η x2 , with k 1 and k 2 components, and assume that H, the natural parameter space of the exponential family, is We also assume that where , denote a generic vector and ξ 0 the true parameter. Suppose n independent and identically distributed copies of Z = (X, Y) satisfying Equations (12) and (13), with true parameter vector ξ 0 , are available. Given a realization z i = (x i , y i ), i = 1, . . . , n, the conditional log-likelihood, up to a factor that does not depend on the parameter of interest, is Let By definition (3), the maximum likelihood estimator (MLE) of the parameter indexing model (12) subject to (13), if it exists, is an M-estimator. Theorem 3.1 next establishes the existence, consistency and asymptotic normality ofξ n , the MLE of ξ 0 . Theorem 3.1: Assume that Z = (X, Y) satisfies model (12) subject to (13) with true parameter ξ 0 . Under regularity conditions (A20), (A21), (A22) and (A23) in the Appendix, the maximum likelihood estimate of ξ 0 ,ξ n , exists, is unique and converges in probability to ξ 0 . Moreover, √ n(ξ n − ξ 0 ) is asymptotically normal with covariance matrix where and ∇ 2 ψ was defined in Equation (11).

Partial reduced rank multivariate generalized linear models
When the number of natural parameters or the number of predictors is large, the precision of the estimation and/or the interpretation of results can be adversely affected. A way to address this is to assume that the parameters live in a lower dimensional space. That is, we assume that the vector of predictors can be partitioned as x = (x T 1 , x T 2 ) T with x 1 ∈ R r and x 2 ∈ R p−r , and that the parameter corresponding to x 1 , β 1 ∈ R k 1 ×r , has rank d < min{k 1 , r}. In this way, the natural parameters η x in Equation (12) are related to the predictors via where Following Yee and Hastie [6], we refer to the exponential conditional model (12) subject to the restrictions imposed in Equation (17) as partial reduced rank multivariate generalized linear model. The reduced-rank multivariate generalized linear model is a special case of model (17) with β 2 = 0.
To obtain the asymptotic distribution of the M-estimators for this reduced model, we will show that Conditions 2.2-2.6 are satisfied for res , ( , g), M and (S, h), which are defined next. To maintain consistency with notation introduced in Section 2.1, we vectorize each matrix involved in the parametrization of our model and reformulate the parameter space accordingly for each vectorized object. With this understanding, we use the symbol ∼ = to indicate that a matrix space component in a product space is identified with its image through the operator vec : R m×n → R mn . In the sequel, to keep the notation as simple as possible, we concatenate column vectors without transposing them; that is, we write (a, b) for (a T , b T ) T . Moreover, we write ξ = (η 1 , β,η 2 ), with the understanding that β stands for vec(β).
For the non-restricted problem, β 1 belongs to R k 1 ×r , so that the entire parameter ξ = (η 1 , β 1 , β 2 ,η 2 ) belongs to However, for the restricted problem, we assume that the true parameter ξ 0 = (η 01 , β 01 , β 02 ,η 02 ) belongs to Let and consider g : . Without loss of generality, we assume that β 01 ∈ R k 1 ×r first,d , the set of matrices in R k 1 ×r d whose first d rows are linearly independent. Therefore, This is trivial since β 01 ∈ R k 1 ×r d and its first d rows are linearly independent. Consider Finally, let and h : S → M be the map Under the rank restriction on β 1 in Equation (17), the existence of the maximum likelihood estimator cannot be guaranteed in the sense of an M-estimator as defined in Equation (3) with m ξ = m (β;η) in Equation (15), and replaced by res in Equation (18). However, we can work with a strong M-estimator sequence for the criterion function P n m (β;η) over res using Lemma 2.3. Theorem 3.3 states our main contribution. (12) subject to (17), with ξ 0 ∈ res defined in Equation (19). Then, there exists a strong maximizing sequence for the criterion function P n m ξ over res for m ξ = m (β;η) defined in Equation (15). Moreover, any weak M-estimator sequence {ξ res n } converges to ξ 0 in probability.
If {ξ res n } is a strong M-estimator sequence, then where W ξ 0 is defined in Equation (16) and with β 01 = C 0 B 0 , C 0 ∈ R k 1 ×d and B 0 ∈ R d×r any decomposition of β 01 .

Remark 3.4:
The asymptotic variance of the estimator does not depend on the specific decomposition of β 01 . That is, for another β 01 =C 0B0 , even if (26) changes, (25) remains the same.

Remark 3.5:
A plug-in procedure can be used to estimate the elements of the matrix in Equation (25) and then, appealing to Slutzky's Theorem, asymptotic confidence regions forξ res n can be constructed. Also, bootstrap variance estimates can be computed by sampling with replacement from the empirical distribution, leading to so-called normal bootstrap intervals (see, e.g. [16], Section 8.3).

Remark 3.6:
) W ξ 0 are non-negative, so that using partial reduced-rank multivariate generalized linear models results in efficiency gain.

Application: marital status in a workforce study
Yee and Hastie [6] analyse data from a self-administered questionnaire collected in a large New Zealand workforce observational study conducted during 1992-1993. For homogeneity, the analysis was restricted to a subset of 4105 European males with no missing values in any of the variables used. Yee and Hastie [6] were interested in exploring whether certain lifestyle and psychological variables were associated with marital status, especially separation/divorce. The response variable is Y = maritalstatus, with levels 1 = single, 2 = separatedordivorced, 3 = widower, and 4 = marriedorlivingwithapartner. The married/partnered are the reference group. Data on 14 regressors were collected, 12 of which are binary (1/0 for presence/absence, respectively). These have been coded so that their presence is negative healthwise. Their goal was to investigate if and how these 12 unhealthy variables were related to Y , adjusting for age and level of education. The variables are described in Table 1.
A categorical response Y taking values 1,2,3,4 with probabilities pr(Y = i) = p i can be expressed as a multinomial vector to fit the generalized linear model presented in this paper, The pdf of Y can be written as where y = (y 1 , y 2 , y 3 , y 4 ), T(y) = (y 1 , y 2 , Let X be the vector of predictor variables and consider The dependence of p i on x will not be made explicit in the notation. As in [6] we fit a multinomial regression model to Y | X = x, as in Equation (27), with whereη ∈ R 3 is the intercept and β ∈ R 3×14 is the coefficient matrix, so that there are 3 × 14 + 3 = 42 + 3 = 45 parameters to estimate. When a multinomial linear model is fitted to the data at level 0.05, age30 and binge are significant for log(p 1 /p 4 ), smokehow and tense for log(p 2 /p 4 ), and only age30 is significant for log(p 3 /p 4 ).  We next fitted a partial reduced rank multivariate generalized linear model, where the two continuous variables, age30 and logedu1, were not subject to restriction. That is, where x 2 represent the continuous variables and x 1 the 12 binary predictors. The AIC criterion estimates the rank of β 1 in Equation (29) to be one (see [6]). Using the asymptotic results from the current paper, Duarte [17] developed a test based on Bura and Yang [18] that also estimates the dimension to be 1. Therefore, in our notation, q = 4, k = k 1 = 3, p = 14, r = 2, d = 1, and β 1 = AB, A : 3 × 1 y B : 1 × 12, β 2 : 3 × 2 andη : 3 × 1. The rank restriction results in a drastic reduction in the total number of parameters from 45 to 24. The reduction in the estimation burden is also reflected in how tight the confidence intervals are compared with those in the unrestricted model, as can be seen in Table 3 and Table 2 in [6]. As a consequence the variables nervous, hurt, which are not significant in the unrestricted generalized linear model, are significant in the reduced (29). Furthermore, some variables, such as binge, smokenow, nervous and tense, are now significant for all responses.
All significant coefficients are positive. These correspond to the variables binge, smokenow, nervous, tense and hurt for single and divorced/separated groups. Since the positive value of the binary variables indicates poor lifestyle and negative psychological characteristics, our analysis concludes that for men with these features, the chance of being single, divorced or widowed is higher than the chance of being married, adjusting for age and education. Also, the coefficients corresponding to the response log(p 3 /p 4 ) are twice as large as those of log(p 1 /p 4 ), suggesting the effect of the predictors differs in each group. All computations were performed using the R package VGAM, developed by Yee [15]. The R script to reproduce the data analysis in Section 4 can be accessed at supplemental data.

Discussion
With the exception of the work of Yee and his collaborators on VGLMs [6,7,19], where the distribution of the response can be any member of the multivariate exponential family, reduced-rank regression has been almost exclusively restricted to regressions with continuous response variables. Estimation methods and the corresponding software for general partial reduced rank multivariate generalized linear models were developed by Yee and Hastie [6] and Yee [7]. Yet, the distribution and statistical properties of the estimators were not obtained. In this paper we fill this gap by developing asymptotic Table 3. MLEs with their standard errors in parentheses for the full rank generalized linear model (first two rows) and the partial reduced rank generalized linear model (last two rows). theory for the restricted rank maximum likelihood estimates of the parameter matrix in multivariate GLMs.
To illustrate the potential impact of our results, we refer to the real data analysis example in Section 4. In order to assess the significance of the predictors, Yee and Hastie [6] calculate the standard errors for the coefficient matrix factors, A and B, independently and can only infer about the significance of the components of the matrix A and the components of the matrix B separately. The asymptotic distribution for either estimator is obtained assuming that the other is fixed and known. In this way, they first analyse ν = Bx 1 to check which predictors are significant and then Aν to examine how they influence each response. Their standard errors are ad-hoc and it is unclear what the product of standard errors measures as relates to the significance of the product of the components of the coefficient matrix β 1 = AB. Moreover, this practical ad-hoc approach cannot readily be extended when d > 1.
Using the results of Theorem 3.3, we can obtain the errors of each component of the coefficient matrices A, B simultaneously, and assess the statistical significance of each predictor on each response. Using the ad-hoc approach of Yee and Hastie [6], a predictor can only be found to be significant across all responses. For example, Yee and Hastie [6] find the predictor hurt to be significant for all three groups (single, divorced/separated, widower). On the other hand, we can assess the significance of any response/predictor combination. Thus, we find hurt to be significant for single and divorced/separated groups, but not for widower men group (see Table 3).
A potential future direction for our approach was brought to our attention by a referee. The computational cost for fitting a reduced rank multinomial logistic regression can be very high. Powers et al. [20] proposed replacing the rank restriction with a restriction on the nuclear norm which amounts to a convex relaxation of the reduced-rank multinomial regression problem. Our methodology can be adapted to obtain asymptotic inference for the regularized parameter estimates.

Acknowledgement
Mariela Sued acknowledges the support of the Abdus Salam International Center for Theoretical Physics (ICTP), where part of this research was carried out.

Disclosure statement
No potential conflict of interest was reported by the authors.

Supplemental data and underlying research materials
The R script to reproduce the data analysis in Section 4 can be accessed at supplemental data.

Appendix. Proofs
The consistency of M-estimators has long been established (see, for instance, Theorem 5.7 in [14]). The functions M(ξ ) and M n (ξ ) are defined in Equation (2). Typically, the proof for the consistency of M-estimators assumes that ξ 0 , the parameter of interest, is a well-separated point of maximum of M, which is ascertained by assumptions We need to prove that there exists a unique maximizer of M n (ξ ) = P n ξ , and that it converges to the maximizer of M(ξ ) = Pm ξ . We first consider the deterministic case ignoring for the moment that {M n } is a sequence of random functions.
Turning = 0 on c n . Then, since pr( n ) → 1, {ξ res n } is a strong M-estimator for the criterion function M n over res .
Proof of Proposition 2.4: In Proposition A.1 we established the existence of a unique maximizerξ n for the criterion function M n over . We can now invoke Lemma 2.3 to guarantee the existence ofξ res n , a strong M-estimator for the criterion function M n over res . Let {ξ res n } be any strong M-estimator for the criterion function M n over res . We start from the deterministic case: where M n is defined in Equation (2) and A n is a sequence of real numbers with A n → 0. As in the proof of Proposition A.1, define ζ n (ξ ) = M n (ξ ) − M n (ξ 0 ) to obtain that, for 0 small enough, for n large enough. Under Condition 2.2, ξ 0 ∈ res , and therefore, by Equation (A9), Since A n → 0, −A n > −δ(ε 0 ) for n large enough. Combining this with Equations (A10) and (A11) obtains We will deduce that ξ res n − ξ 0 < ε 0 , once we prove that sup Now, let us prove Equation (A12). Choose ξ with ξ − ξ 0 > ε 0 , and take t ∈ (0, 1) such thatξ = (1 − t)ξ n + tξ is a distance ε 0 from ξ 0 , whereξ n is the maximizer of ζ n over , as defined in Proposition A.1, which is assumed to be at distance smaller than ε 0 from ξ 0 . Then, Thus, sup which in turn yields (A12). When A n = o p (1), convergence in probability of {ξ res n } to ξ 0 is equivalent to the existence of an almost everywhere convergent sub-sub-sequence for any subsequence {ξ res n k }. Therefore, by applying the deterministic result to the set of probability one, where there exists a sub-subsequence of A n k that converges to zero a.e. we obtain the result.

Condition A.2: For each ξ in an open subset of a Euclidean space, m ξ (z) is a measurable function in z such that m ξ (z)
is differentiable in ξ 0 for almost every z with derivativeṁ ξ0 (z).

Condition A.3:
There exists a measurable function φ(z) with Pφ 2 < ∞ such that, for any ξ 1 and ξ 2 in a neighbourhood of ξ 0 , Moreover, √ n(ξ n − ξ 0 ) is asymptotically normal with mean zero and This result will be invoked in the following proofs.
Proof of Proposition 2.7: Assume that {ξ res n } is a sequence in res that converges in probability to ξ 0 ∈ M, which is assumed to be open in res . Then, pr(ξ res where ∇h ∞,Ns 0 denotes the maximum of ∇h(s) in a neighbourhood N s0 of s 0 . The first inequality holds because such condition is valid in the unconstrained problem and the second inequality follows since h is continuously differentiable at s 0 . Thus, the Lipschitz Condition A.3 is satisfied.
For Condition A.4, we observe that the function s → Pm h(s) is twice continuously differentiable in s 0 because both Pm ξ and h(s) satisfy the required regularity properties at ξ 0 and s 0 , respectively. Moreover, since Pṁ ξ0 = 0, the second derivative matrix of Pm h(s) at s 0 , is W s0 = ∇h(s 0 ) T V ξ0 ∇h(s 0 ), where V ξ0 is the second derivative matrix of Pm ξ at ξ 0 . The matrix W s0 is non-singular and symmetric because ∇h(s 0 ) is full rank and V ξ0 is non-singular and symmetric.
We can now apply Theorem 5.23 in [14], and obtain so that the first-order Taylor series expansion of h(s * n ) around s 0 is for ξ0(−Vξ 0 ) and IF ξ0 (z) defined in Equations (6) and (5), respectively, which obtains the expansion in Equation (7). Now, Equation (8) follows immediately.

Proof of Theorem 3.1: Write
Then, in matrix form, where and ξ = (η T 1 , vec T (β),η T 2 ) T is the vector of parameters of model (13). Note that F(x)ξ ∈ H for any value of ξ with H defined in Equation (10). This notation allows to simplify the expression for the log-likelihood function in Equation (15), and replace it with The regularity conditions required to derive consistency and asymptotic distribution of the MLE are: For any ξ and any η and for any compact To prove the existence, uniqueness and consistency of the MLE under the present model,ξ n , we next show that the assumptions stated in Proposition A.1 are satisfied.
The strict convexity of ψ implies that, for each fixed z, m ξ (z) is a strictly concave function in ξ ∈ = R k1+pk1 × H 2 .
The concavity of m ξ (z) is preserved under expectation, thus M(ξ ) = Pm ξ is concave. The identifiability condition satisfied by the exponential family in Section 3.1 allows applying Lemma 5.35 of van der Vaart [14, p.62] to conclude that Taking expectation with respect to X, we conclude that Pm ξ ≤ Pm ξ0 for any ξ . Moreover, if Pm ξ1 = Pm ξ0 , pr(F(X)ξ 1 = F(X)ξ 0 ) = 1, which contradicts the hypothesis (A20). Finally, the integrability of Equation (A1) follows from Equation (A21). The conditions A.2-A.4 required by van der Vaart's Theorem 5.23 to derive the asymptotic distribution of Mestimators are easily verifiable under the integrability assumptions stated in Equations (A22) and (A23).
The second derivative matrix of Pm ξ at ξ 0 is Finally, observe that to deduce that, according to the general formula for the asymptotic variance of an M-estimator in (A15), the asymptotic variance of √ n(ξ n − ξ 0 ) is given by (16).
Lemmas A.5-A.10 and Corollary A.6 are required to prove Proposition 3.2.
Lemma A.5: Assume that β 01 ∈ R k1×r first,d and can be written as in Equation (21 Thus, T 0 = U −1 B 0 , and since T 0 T T 0 ∈ R d×d and rank d, Corollary A.6: Let β 01 ∈ R k1×r first,d and can be written as in Equation (21). Proof: We will show that the complement of R d×m d is closed. Consider (T n ) n≥1 ∈ R d×m , each T n of rank strictly less than d, and assume that T n converges to T ∈ R d×m . Note that, |T n T T n | = 0 for all n ≥ 1, so that |TT T | is also equal to zero. Hence, rank(T) < d.
with T n1 ∈ R d×r and T n2 ∈ R (k1−d)×r , we obtain that | T n1 T T n1 |= 0 for all n. Then | T 1 T T 1 |= 0, where T 1 comprises of the first d rows of T, and therefore the first d rows of T are not linearly independent.