Nonparametric homogeneity pursuit in functional-coefficient models

ABSTRACT This paper explores the homogeneity of coefficient functions in nonlinear models with functional coefficients and identifies the underlying semiparametric modelling structure. With initial kernel estimates, we combine the classic hierarchical clustering method with a generalised version of the information criterion to estimate the number of clusters, each of which has a common functional coefficient, and determine the membership of each cluster. To identify a possible semi-varying coefficient modelling framework, we further introduce a penalised local least squares method to determine zero coefficients, non-zero constant coefficients and functional coefficients which vary with an index variable. Through the nonparametric kernel-based cluster analysis and the penalised approach, we can substantially reduce the number of unknown parametric and nonparametric components in the models, thereby achieving the aim of dimension reduction. Under some regularity conditions, we establish the asymptotic properties for the proposed methods including the consistency of the homogeneity pursuit. Numerical studies, including Monte-Carlo experiments and two empirical applications, are given to demonstrate the finite-sample performance of our methods.


Introduction
We consider the functional-coefficient model defined by where Y t is a response variable, X t = (X t1 , . . . , X tp ) is a p-dimensional vector of random covariates, β 0 (·) = [β 0 1 (·), . . . , β 0 p (·)] is a p-dimensional vector of functional coefficients, U t is a univariate index variable, and ε t is an independent and identically distributed (i.i.d.) error term. The functional-coefficient model is a natural extension of the classic linear regression model by allowing the regression coefficients to vary with a certain index variable and thus captures a flexible dynamic relationship between the response and covariates. In recent years, there have been extensive studies on estimation and model selection for model (1) and its various generalised versions (see, e.g. Fan and Zhang 1999;Cai, Fan, and Yao 2000;Xia, Zhang, and Tong 2004;Fan and Zhang 2008;Wang and Xia 2009;Kai, Li, and Zou 2011;Park, Mammen, Lee, and Lee 2015).
However, when the number of functional coefficients is large or moderately large, it is well known that a direct nonparametric estimation of the potentially p different coefficient functions in model (1) would be unstable. To address this issue, there have been some extensive studies in the literature on selecting significant variables in functionalcoefficient models (Fan, Ma, and Dai 2014;Liu, Li, and Wu 2014) or exploring certain rank-reduced structures in functional coefficients (Jiang, Wang, Xia, and Jiang 2013;Chen, Li, and Xia 2019), both of which aim to reduce the dimension of unknown functional coefficients and improve estimation efficiency. In this paper, we consider a different approach, i.e. we assume that there is a homogeneity structure on model (1) so that individual functional coefficients can be grouped into a number of clusters and coefficients within each cluster have the same functional pattern. Throughout the paper, we assume that the dimension p may depend on the sample size n and can be divergent with n, but the number of unknown clusters is fixed and much smaller than p. It is easy to see that the dimension reduction through homogeneity pursuit is more general than the commonly used sparsity assumption in high-dimensional functional-coefficient models (cf. Fan et al. 2014;Liu et al. 2014;Zhang 2015 Lee andMammen 2016) as the latter can be seen as a special case of the former with a very large group of zero coefficients. Specifically, we assume the following homogeneity structure on model (1): there exists a partition of {1, 2, . . . , p} denoted by where the Lebesgue measure of {u ∈ U : α 0 k 1 (u) − α 0 k 2 (u) = 0} is positive and bounded away from zero for any 1 ≤ k 1 = k 2 ≤ K 0 , and U is a compact support of the index variable U t . Furthermore, some of the functional coefficients α 0 k (·) are allowed to have constant values, in which case model (1) is semiparametric with a combination of constant and functional coefficients. Our aim is to (i) explore homogeneity structure (2) by estimating the unknown number of clusters K 0 and identifying members of the clusters C 0 1 , . . . , C 0 K 0 ; and (ii) identify the clusters of constant coefficients and those of coefficients varying with U t and estimate their unknown values.
The topic investigated in our paper has two close relatives in the existing literature. On the one hand, the functional-coefficient regression with the homogeneity structure is a natural extension of linear regression with homogeneity structure, which has received increasing attention in recent years. For example, Tibshirani, Saunders, Rosset, Zhu, and Knight (2005) introduce the so-called fused LASSO method to study slope homogeneity; Bondell and Reich (2008) propose the OSCAR penalised method for grouping pursuit; Shen and Huang (2010) use a truncated L 1 penalised method to extract the latent grouping structure; and Ke, Fan, and Wu (2015) propose the CARDS method to identify the homogeneity structure and estimate the parameters simultaneously. On the other hand, this paper is also relevant to some recent literature on longitudinal/panel data model classification. For example, Ke, Li, and Zhang (2016) and Su, Shi, and Phillips (2016) consider identifying the latent group structure for linear longitudinal data models by using the binary segmentation and shrinkage method, respectively; Vogt and Linton (2017) introduce a kernel-based classification of univariate nonparametric regression functions in longitudinal data; and Su, Wang, and Jin (2019) propose a penalised sieve estimation method to identify latent grouping structure for time-varying coefficient longitudinal data models. The methodology of nonparametric homogeneity pursuit developed in this paper will be substantially different from those in the aforementioned literature.
In this paper, we first estimate each functional coefficient in model (1) by using the kernel smoothing method and ignoring homogeneity structure (2), and calculate the L 1 -distance between the estimated functional coefficients. Then, we combine the classic hierarchical clustering method and a generalised version of the information criterion to explore homogeneity structure (2), i.e. estimate K 0 and the members of C 0 k , k = 1, . . . , K 0 . Under some mild conditions, we show that the developed estimators for the number K 0 and the index sets C 0 k , k = 1, . . . , K 0 , are consistent. After estimating structure (2), we further estimate a semi-varying coefficient modelling framework by determining the zero coefficients, non-zero constant coefficients and functional coefficients varying with the index variable. This is done by using a penalised local least squares method, where the penalty function is the weighted LASSO with the weights defined as derivatives of the well-known SCAD penalty introduced by Fan and Li (2001). With the nonparametric cluster analysis and the penalised approach, we can reduce the number of unknown components in model (1) from p to K 0 − 1 (if the zero constant coefficients exist in the model). Furthermore, the choice of the tuning parameters in the proposed estimation approach and the computational algorithm is also discussed. The simulation studies show that the proposed methods have reliable finite-sample numerical performance. We finally apply the model and methodology to analyse the Boston house price data and the plasma beta-carotene level data, and find that the original nonparametric functional-coefficient models can be simplified and the number of unknown components involved can be reduced. In particular, the out-of-sample mean absolute prediction errors of our approach are usually much smaller than those using the naive kernel method which ignores the latent homogeneity structure.
The rest of the paper is organised as follows. Section 2 introduces the clustering method, information criterion and penalised method to determine the unknown clusters and estimate the unknown components. Section 3 establishes the asymptotic theory for the proposed clustering and estimation methods. Section 4 discusses the choice of the tuning parameters and introduces an algorithm for computing the penalised estimates. Section 5 reports Monte-Carlo simulation studies. Section 6 gives the empirical applications to the Boston house price data and the plasma beta-carotene level data. Section 7 concludes the paper. The proofs of the main asymptotic theorems are given in a supplemental document.

Methodology
In this section, we first introduce a clustering method for kernel estimated functional coefficients in Section 2.1, followed by a generalised information criterion for determining the number of clusters in Section 2.2, and finally, propose a penalised local linear estimation approach to identify the semi-varying coefficient modelling structure in Section 2.3.
is a kernel function and h is a bandwidth which tends to zero as the sample size n diverges to infinity. Then the kernel estimationβ(u 0 ) can be expressed as follows: where u 0 is on the support of the index variable. Note that other commonly used nonparametric estimation methods such as the local polynomial method (Fan and Gijbels 1996) and B-spline method (Green and Silverman 1994) are also applicable to obtain the preliminary estimates. Without loss of generality, we let U = [0, 1] be the compact support of the index variable U t . Define˜ whereβ i (·) is defined in (3), I(·) is the indicator function and U h = [h, 1 − h]. The aim of truncating the observations outside U h is to overcome the so-called boundary effect in the kernel estimation. Noting that h → 0, the set U h can be sufficiently close to U, and thus, the information loss is negligible. In fact,˜ ij can be viewed as a natural estimate of where f U (·) is the density function of U t . Under some smoothness conditions on β 0 i (·) and f U (·), we may show that From (2) and (5), we have 0 ij = 0 for i, j ∈ C 0 k , and 0 ij = 0 for i ∈ C 0 k 1 and j ∈ C 0 k 2 with k 1 = k 2 . Then, we define a distance matrix among the functional coefficients, denoted by 0 , whose (i, j)-entry is 0 ij . The corresponding estimated distance matrix, denoted bỹ n , has entries˜ ij defined in (4). It is obvious that both 0 and˜ n are p × p symmetric matrices with the main diagonal elements being zeros.
We next use the well-known agglomerative hierarchical clustering method to explore the homogeneity among the functional coefficients. This clustering method starts with p singleton clusters, corresponding to the p functional coefficients. In each stage, the two clusters with the smallest distance are merged into a new cluster. This continues until we end with only one full cluster. Such a clustering approach has been widely studied in the literature on cluster analysis (cf. Everitt, Landau, Leese, and Stahl 2011;Rencher and Christensen 2012). However, to the best of our knowledge, there is virtually no work combining the agglomerative hierarchical clustering method with the kernel smoothing of functional coefficients in nonparametric homogeneity pursuit. This paper fills in this gap. Specifically, the algorithm is described as follows, where the number of clusters K 0 is assumed to be known. Section 2.2 will introduce an information criterion to determine the number K 0 .
(1) Start with p clusters each of which contains one functional coefficient and search for the smallest distance among the off-diagonal elements of˜ n . (2) Merge the two clusters with the smallest distance, and then re-calculate the distance between clusters and update the distance matrix. Here the distance between two clusters A and B is defined as the farthest distance between a point in A and a point in B, which is called the complete linkage. (3) Repeat steps 1 and 2 until the number of clusters reaches K 0 .
LetC 1 , . . . ,C K 0 be the estimated clusters obtained via the above algorithm when the true number of clusters is known a priori. More generally, if the number of clusters is assumed to be K with 1 ≤ K ≤ p, we stop the above algorithm when the number of clusters reaches K, and letC 1 | k , . . . ,C K | K be the estimated clusters.

Estimation of the cluster number
In practice, the true number of clusters is usually unknown and needs to be estimated. When the number of clusters is assumed to be K, we define the post-clustering kernel estimation for the functional coefficients: C k | K is defined as in Section 2.1. When the number K is larger than K 0 ,α K (·) is still a uniformly consistent kernel estimate of the functional coefficients (cf. the proof of Theorem 3.2 in the appendix); but when K is smaller than K 0 , the clustering approach in Section 2.1 results in a misspecified functional-coefficient model andα K (·) constructed in (6) can be viewed as the kernel estimate of the 'quasi' functional coefficients which will be defined in (14). We define the following objective function: with 0 < ρ < 1, and determine the number of clusters through whereK is a pre-specified finite positive integer which is larger than K 0 . In practical application,K can be chosen the same as the dimension of covariates p if the latter is either fixed or moderately large. If we choose ρ close to 1 and treat nh as the 'effective' sample size, the above criterion would be similar to the classic Bayesian information criterion introduced by Schwarz (1978). Su et al. (2016) use a similar information criterion to determine the group number in linear longitudinal data models. The Bayesian information criterion has been extended to the nonparametric framework in recent years (cf. Wang and Xia 2009).

Penalised local linear estimation
We next introduce a penalised approach to further identify the clusters with non-zero constant coefficients and the cluster with zero coefficient. For notational simplicity, we let X t =X t,K andα(u 0 ) = [α 1 (u 0 ), . . . ,αK(u 0 )] be defined similarly toα K (u 0 ) in (6) with K =K. Throughout the paper, we callα(·) the post-clustering kernel estimator. It is obvious that identifying the constant coefficients is equivalent to identifying the functional coefficients such that either their derivatives are zero or the deviation of the functional coefficients, D 0 k , is zero (cf. Li et al. 2015), where In practice, we may estimate the deviation of the functional coefficients bỹ , for k = 1, . . . ,K. Let A = a 1 , . . . , a n , a t = (a t1 , . . . , a tK ) ; As in Li et al. (2015), we define the penalised objective function as follows: where , · denotes the Euclidean norm, λ 1 and λ 2 are two tuning parameters, and p λ (·) is the derivative of the SCAD penalty function (Fan and Li 2001): Following Fan and Li's (2001) recommendation, we choose a * = 3.7 in this paper. Let be the minimiser of the objective function Q n (A, B) defined in (9). Through the penalisation, we would expect A k = 0 whenC k|K is the estimated cluster with zero coefficient, and B k = 0 whenC k|K is the estimated cluster with a non-zero constant coefficient (see (20) in Theorem 3.3). Hence, if A k = 0, the corresponding covariates are not significant and should be removed from functional-coefficient model (1); and if B k = 0, the functional coefficient has a constant value and can be consistently estimated by Implementation of the proposed methods in Sections 2.1-2.3 is summarised in the following flowchart.

Asymptotic theorems
In this section, we give the asymptotic theorems for the proposed clustering and semiparametric penalised methods. We start with some regularity conditions, some of which might be weakened at the expense of more lengthy proofs.

Assumption 3.4: (i) Let the bandwidth h and the dimension p satisfy
Remark 3.1: Assumptions 3.1-3.3 are some commonly used conditions on the kernel estimation of the functional-coefficient models. The strong moment condition on ε t and X t in Assumption 3.3(ii) is required when applying the uniform asymptotics of some kernelbased quantities. The independence condition between ε t and (U t , X t ) seems restrictive, but may be replaced by the following heteroscedastic error structure: where η t is independent of (U t , X t ) and σ 2 (·, ·) is a conditional volatility function. By slightly modifying our proofs, the asymptotic properties continue to hold under this relaxed error condition. Assumption 3.4(i) restricts the divergence rate of the regressor dimension and the convergence rate of the bandwidth. In particular, if ι 1 is sufficiently large (i.e. the moment conditions in Assumption 3.3(ii) becomes stronger), the condition n 2ι 2 −1 h → ∞ could be close to the conventional condition nh → ∞. Assumption 3.4(ii) indicates that the difference between two functional coefficients (in different clusters) can be convergent to zero with a certain polynomial rate. In particular, when p is fixed, h = c h n −1/5 with 0 < c h < ∞, and δ n = n −δ 0 with 0 ≤ δ 0 < 2/5, Assumption 3.4(ii) would be automatically satisfied. On the other hand, letting h = c h n −1/5 and δ n = n −1/5 (log n) 1/4 , it follows from Assumption 3.4(i)(ii) that p = o min n 2/5 (log n) −1/2 , n 4/5 δ 2 n (log n) −1 = o n 2/5 (log n) −1/2 , indicating that the dimension p may be divergent to infinity at a polynomial rate of n.
Theorem 3.1: Suppose that Assumptions 3.1-3.4 are satisfied and K 0 is known a priori. Then, we have when the sample size n is sufficiently large, whereC k is defined in Section 2.1 and C 0 k is defined in (2).

Remark 3.2:
The above theorem shows the consistency of the agglomerative hierarchical clustering method proposed in Section 2.1 when the number of clusters is known a priori, i.e. with probability approaching one, the K 0 clusters can be correctly specified. It is similar to Theorem 3.1 in Vogt and Linton (2017) which gives the consistency of the classification of univariate nonparametric function in the longitudinal data setting by using the nonparametric segmentation method.
We next derive the consistency for the information criterion on estimating the number of clusters which is usually unknown in practice. Some further notation and assumptions are needed. Define Similarly, we can define X | K (u) when K > K 0 and there are further splits on at least one of C 0 k , k = 1, . . . , K 0 . Define the event: From (12) in Theorem 3.1, we have P(C n (K 0 )) → 1 as n → ∞. Conditional on the event C n (K 0 ), when the number of clusters K is smaller than K 0 , two or more clusters of C 0 k , k = 1, . . . , K 0 , are falsely merged, which results in K clusters denoted by C 1 | K , . . . , C K | K , respectively, 1 ≤ K ≤ K 0 − 1. With such a clustering result, the group-specific functional coefficients cannot be consistently estimated by the kernel smoothing method, as the model is misspecified. However, we may define the 'quasi' functional coefficients by where and it is easy to find that the quasi-functional coefficients become the 'genuine' functional coefficients conditional on the event C n (K 0 ). Define where 0 is a null vector whose dimension might change from line to line. A natural nonparametric estimate of α K (·) would beα K (·) defined in (6) of Section 2.2, where the order of elements may have to be re-arranged if necessary. Result (16) and some smoothness condition on α K (·) would ensure the uniform consistency of the quasi-kernel estimation (see the proof of Theorem 3.2 in the supplemental document). Let A(K 0 ) be the set of K 0 -dimensional twice continuously differentiable functions α(u) = [α 1 (u), . . . , α K 0 (u)] such that at least two elements of α(u) are identical functions over u ∈ [0, 1]. The following additional assumptions are needed for proving the consistency of the information criterion proposed in Section 2.2.
Assumption 3.5: There exists a positive constant c α such that Assumption 3.6: (i) For any 1 ≤ K ≤K and given has continuous second-order derivatives.

Assumption 3.7: The bandwidth h and the dimension p satisfy ph
Remark 3.3: Assumptions 3.5 and 3.6 are mainly used when deriving the asymptotic lower bound ofσ 2 n (K) which is involved in the definition of IC(K) when K is smaller than K 0 . Restriction (17) in Assumption 3.5 indicates that the K 0 functional elements in α 0 (·) needs to be 'sufficiently' distinct. We may show that (17) is positive for any k 1 = k 2 . Assumption 3.6 is required to prove the uniform consistency of the kernel estimation for the quasi-functional coefficients. Assumption 3.7 gives some further restriction on h and p, and indicates that the dimension of the covariates can diverge to infinity at a slow polynomial rate of the sample size n. For example, letting h = n −1/4 (i.e. under-smoothing in the kernel estimation), ρ = 1/3 and p = n δ 1 with 0 ≤ δ 1 < 1/8, we may verify the conditions in Assumption 3.7.
Theorem 3.2 shows that the estimated number of clusters which minimises the IC objective function defined in (7) is consistent.
A combination of (12) and (18) shows that the latent homogeneity structure can be consistently estimated. Define Without loss of generality, conditional on C n (K 0 ) andK = K 0 , we assume thatC 1 = C 0 1 , . . . ,C K 0 = C 0 K 0 , otherwise we only need to re-arrange the order of the elements in . For notational simplicity, we also assume that α 0 where α 0 k are non-zero constant coefficients (non-zero constant coefficients do not exist when K * = K 0 and all of the functional coefficients are constant when K * = 1). For simplicity, we next assume that all the observations of the index variable, U t , t = 1, . . . , n, are in the set U h , to avoid the boundary effect of the kernel estimation, but this assumption can be removed if an appropriate truncation technique, such as those discussed in Sections 2.1 and 2.2, is applied to the penalised local linear estimation. Some additional conditions are needed for deriving the sparsity property for the penalised estimation proposed in Section 2.3.

Assumption 3.8: For any
n with probability approaching one.
Remark 3.4: Assumption 3.8 is a key condition to prove that Ã k / √ n andD k / √ n are bounded away from zero with probability approaching one, which together with the definition of the SCAD derivative and λ 1 + λ 2 = o(n 1/2 ) in Assumption 3.9, indicates that when the functional coefficients or their deviations are significant, the influence of the penalty term in (9) can be asymptotically ignored. For the case when p is fixed and h = c h n −1/5 as discussed in Remark 3.1, if we choose λ 1 = λ 2 = n δ * with 0.1 < δ * < 0.5, (19) in Assumption 3.9 would be satisfied. On the other hand, as discussed in Remarks 3.1 and 3.3, the dimension p is allowed to be divergent to infinity. Theorem 3.3: Suppose that Assumptions 3.1-3.9 are satisfied. Then, we have as n → ∞, where A k and B k are defined in (10).
The above sparsity result for the penalised local linear estimation shows that the zero coefficient and non-zero constant coefficients in the model can be identified asymptotically.

Practical issues in the estimation procedure
In this section, we first discuss how to choose the bandwidth in the kernel estimation and the tuning parameters in the penalised local least-squares estimation; and then introduce an easy-to-implement computational algorithm for the penalised approach in Section 2.3.

Choice of tuning parameters
The nonparametric kernel-based estimation may be sensitive to the value of bandwidth h. Therefore, choosing an appropriate bandwidth is an important issue when applying our kernel-based clustering and estimation methods. A commonly used bandwidth selection method is the so-called cross-validation criterion. Specifically, for the preliminary (or preclustering) kernel estimation, the objective function for the leave-one-out cross-validation is defined by is the preliminary kernel estimator of β 0 (·) in model (1) using the bandwidth h and all observations except the tth observation. Then we determine the optimal bandwidthĥ opt by minimising CV(h) with respect to h. The cross-validation criterion for bandwidth selection in the post-clustering kernel estimation α(·) can be defined in exactly the same way.
For the choice of the tuning parameters λ 1 and λ 2 in the penalised local leastsquares method, we use the generalised information criterion (GIC) proposed by Fan and Tang (2013), which is briefly described as follows. Let λ = (λ 1 , λ 2 ) and denote M 1 (λ) and M 2 (λ) the index sets of nonparametric functional coefficients and non-zero constant coefficients, respectively (after implementing the kernel-based clustering analysis and penalised estimation with the tuning parameter vector λ). As Cheng, Zhang, and Chen (2009) suggest that an unknown functional parameter (varying with the index variable) would amount to m 0 h −1 unknown constant parameters with m 0 = 1.028571 when the Epanechnikov kernel is used, we construct the following GIC objective function: where α k,λ (·) and α k,λ are defined as the penalised estimation in Section 2.3 using the tuning parameter vector λ, |M| denotes the cardinality of the set M, and the bandwidth h can be determined by the leave-one-out cross-validation. The optimal value of λ can be found by minimising the objective function GIC(λ) with respect to λ.

Computational algorithm for penalised estimation
LetX t =X t,K = (X t,1|K , . . . ,X t,K|K ) and define˜ We next introduce an iterative procedure to compute the penalised local least-squares estimates of the functional coefficients proposed in Section 2.3 (cf. Li et al. 2015). It can be viewed as a nonparametric extension of the coordinate descent algorithm, which is a commonly used optimisation algorithm that finds the minimum of a function by successively minimising along the coordinate directions.
(1) Find initial estimates of A 0 k and B 0 k , which are denoted bŷ respectively. These initial estimates can be obtained by using the conventional (nonpenalised) local linear estimation method.
k be the estimates after the jth iteration. We next update the lth functional coefficient starting from l = 1. Let Update the derivative of the lth functional coefficient starting from l = 1. Let (4) Repeat Steps 2 and 3 until convergence of the estimates is achieved.
Our numerical studies in Sections 5 and 6 show that the above iterative procedure has reasonably good finite-sample performance.

Monte-Carlo simulation
In this section, we conduct Monte-Carlo simulation studies to evaluate the finite-sample performance of the proposed methods.
The sample size n is set to be 200, 400 or 600, and the number of replications is N = 500. We first use the kernel method to obtain preliminary nonparametric estimates of the functional coefficients β 0 j (·), j = 1, . . . , p, with the Epanechnikov kernel K(z) = 3 4 (1 − z 2 ) + and the optimal bandwidth selected from the cross-validation method in Section 4.1. The homogeneity and semi-varying coefficient structure in model (21) is ignored in this preliminary estimation. A combination of the kernel-based clustering method in Section 2.1 and the generalised information criterion in Section 2.2 is then used to estimate the homogeneity structure. In order to evaluate the clustering performance, we consider two commonly used measures: Normalised Mutual Information (NMI) and Purity, both of which can be used to examine how close the estimated set of clusters is to the true set of clusters. Letting C 1 = {C 1 1 , . . . , C 1 K 1 } and C 2 = {C 2 1 , . . . , C 2 K 2 } be two sets of disjoint clusters of (1, 2, . . . , p), the NMI between C 1 and C 2 is defined as where H(C 1 ) and H(C 2 ) are the entropies of C 1 and C 2 , respectively, and I(C 1 , C 2 ) is the mutual information between C 1 and C 2 defined as The NMI measure takes a value between 0 and 1 with a larger value indicating that the two sets of clusters are closer. The Purity measure is defined by It is easy to find that the Purity measure also takes values between 0 and 1, and if C 1 and C 2 are equal, then Purity(C 1 , C 2 ) = 1. However, the purity measure does not trade off the quality of clustering against the number of clusters. For example, a purity value of 1 is achieved if one set contains singleton clusters. The NMI, by contrast, allows for this tradeoff. We finally identify the clusters with zero coefficients and non-zero constant coefficients using the penalised method introduced in Section 2.3. The tuning parameters in the penalty terms are chosen by the GIC detailed in Section 4.1. In order to measure the accuracy of estimates of the coefficients β 0 j (·), 1 ≤ j ≤ p, we compute the Mean Absolute Estimation Error (MAEE), which, for the preliminary (pre-clustering) kernel estimates, β j (·), 1 ≤ j ≤ p, is defined as and for the post-clustering kernel estimates, , 1 ≤ k ≤K, are the postclustering kernel estimates of cluster-specific functional coefficients defined in (6). Let β j (·) =α k (·) if j ∈C k |K , 1 ≤ k ≤K, whereα k (·), 1 ≤ k ≤K, are the penalised estimates of the cluster-specific functional coefficients obtained by minimising (9). The MAEE of the penalised estimates is defined as The main purpose for considering the MAEE of the post-clustering kernel and penalised estimates for β 0 j (·), 1 ≤ j ≤ p, rather than for α 0 k (·), 1 ≤ k ≤ K 0 , is to avoid having to order the estimated clusters and match each of them to one of the true clusters (as there is no natural way to do this).
Tables 1-3 give the simulation results for the case where the dimension of X t is 20. Table 1 presents the frequency (over 500 replications) at which a number between 1 and 10 is selected as the number of clusters by the information criterion detailed in Section 2.2. Table 2 gives the average values and standard deviations (in parentheses) of the NMI and Purity measurements over 500 replications. Table 3 reports the average MAEEs and standard deviations (in parentheses) over 500 replications for the pre-clustering kernel estimation, post-clustering kernel estimation and the semiparametric penalised estimation. From Table 1, we can see that when δ = 0.4 and the covariates are uncorrelated, the number of clusters can be correctly estimated in about 80% of the replications even when n = 200, and when δ increases to 0.8, this percentage increases to almost 98%. As the sample size increases to 400, the information criterion selects the correct number of clusters in almost all replications. When the correlation coefficient between the covariates is 0.25, the number of clusters is correctly estimated in only 34% of replications when n = 200 and δ = 0.4 and in over 70% of replications when δ = 0.8. As the sample size increases to 400, this percentage rises to over 98%. However, when δ = 0.2, the distances between different coefficient functions become smaller and the number of clusters is often underestimated as 3 or 4, even when the covariates are uncorrelated. When the covariates are correlated, this underestimation becomes worse. In all of the specifications, the estimated number of clusters rarely goes below 3 or above 7. Table 2 shows that when there is no correlation among the covariates and the different coefficient functions are moderately distanced (i.e. δ = 0.4 or 0.8), the NMI and Purity values are close to 1 even when the sample size is as small as 200. The increase of the covariates correlation coefficients to 0.25 or the decrease of δ to 0.2 causes the clustering to become less accurate. Finally, the results in Table 3 show that, after identifying the homogeneity and semi-varying coefficient structure, the average MAEE values of the semiparametric penalised estimation are smaller than those of the post-clustering kernel estimation, which in turn are much smaller than those of the pre-clustering kernel estimation. In addition, all three estimation methods improve (with decreasing average  0  200  0  0  222  185  90  2  1  0  0  0  400  0  0  2  113  381  4  0  0  0  0  600  0  0  0  12  488  0   MAEE values) as the sample size increases, and their performance becomes slightly worse when the correlation between the covariates increases to 0.25. Tables 4-6 give the results for p = 60. Comparing these results with those for p = 20, we can see that as the dimension of the covariates increases, the estimation becomes poorer. However, the overall pattern as δ, or , or n changes is similar: as δ increases, the estimation becomes more accurate due to the clusters becoming further distanced from each other; as increases, the results become poorer; and as n increases, the results improve.
Tables 7 and 8 report the results for the estimation of the homogeneity structure and Table 9 reports the average MAEEs and standard deviations (in parentheses) for the pre-clustering kernel estimation, the post-clustering kernel estimation and the penalised estimation over 500 replications. Comparing the results in Table 7 with those in Table 1, we find that when δ = 0.2, the number of clusters are more likely to be underestimated in Example 5.2 where cluster sizes are unequal. However, as δ increases, the results for the   two examples become more and more comparable. The NMI and purity values in Table 8 are similar to those in Table 2, while the MAEE values in Table 9 are smaller than those in Table 3. The latter is mainly due to the fact that more coefficient functions (i.e. 17 out of 20) are constant in Example 5.2.

Empirical applications
In this section, we apply the developed model and methodology to two real data sets: the Boston house price data and the plasma beta-carotene level data. These two data sets have been extensively analysed in existing studies where functional-coefficient models are usually recommended. However, it is not clear whether certain homogeneity structure among the functional coefficients exists. This motivates us to further examine the modelling structure for these two data sets via the kernel-based clustering method and penalised approach introduced in Section 2.
Example 6.1: We first apply the developed model and methodology to the well-known  0  200  0  0  76  374  40  9  1  0  0  0  400  0  0  0  419  81  0  0  0  0  0  600  0  0  0  363  137  0   Boston house price data. This data set has been previously analysed in many studies (cf. Fan and Huang 2005;Cai and Xu 2008;Wang and Xia 2009;Leng 2010   the Z-score transformation before being fitted: i.e. for any variable, x t , to be transformed, its Z-score is wherex and s(x) are the sample mean and sample standard deviation of x t . Furthermore, as shown in the left panel of Figure 2, the index variable, LSTAT, exhibits strong skewness. Hence, we first take the square-root transformation of this variable to alleviate skewness and then the min-max normalisation:   where min(U) and max(U) denote the minimum and maximum of the observations of U, respectively. After the min-max normalisation, the support of U t becomes [0, 1], consistent with the assumption made on the index variable in the asymptotic theory. A histogram of this transformed variable is shown in the right panel of Figure 2. Figure 3 plots the pre-clustering kernel estimated functional coefficients with the optimal bandwidth selected via the leave-one-out cross-validation method. The kernel-based clustering method and the generalised information criterion identify six clusters. The membership of these clusters and the characteristics of their functional coefficients are summarised in Table 10. DIS and TAX are found, by the penalised method, to have constant and similar negative effects on the response, while the variables CHAS, ZN, and B are   Table 10. found to be insignificant. All the other explanatory variables have varying effects on the response as the value of LSTAT changes. Plots of the post-clustering kernel estimates of the functional coefficients and their penalised local linear estimates are shown in Figures 4  and 5, where for each k = 1, . . . , 6, α k (·) denotes the functional coefficient corresponding to the kth cluster listed in Table 10. The optimal tuning parameters in the penalised method are chosen, by the GIC, as λ 1 = 10 and λ 2 = 2.3.
We next compare the out-of-sample predictive performance between the pre-clustering (preliminary) kernel method, the post-clustering kernel method and the proposed penalised method. We randomly split the full sample into a training set of size 400 and a testing set of size 106 and repeat 200 times to reduce randomness in the results obtained. When calculating out-of-sample predictions for the post-clustering and penalised methods, we use the homogeneity structure (i.e. the clusters and their membership) estimated from the full sample but estimate the values of the functional coefficients (evaluated at the LSTAT values belonging to the testing set) or the constant coefficients from the training sets. The predictive performance is measured by Mean Absolute Prediction Error (MAPE), which is defined by where n is the size of the testing set (106 in this example), Y t is a true value of the response variable in the testing sample, andŶ t is the predicted value of Y t using the model estimated from the training sample. ues. This comparison result shows that the simplified functional-coefficient models from the developed kernel-based clustering and penalised methods provide a more accurate out-of-sample prediction.
Example 6.2: In this example, we use the proposed methods to analyse the plasma beta-carotene level data, which have been previously studied by Nierenberg, Stukel, Baron, Dain, and Greenberg (1989), Wang and Li (2009) and Kai et al. (2011). The data were collected from 315 patients and are downloadable from the StatLib database http://lib.stat.cmu.edu/datasets/Plasma_Retinol. The primary interest is to investigate the relationship between personal characteristics and dietary factors, and plasma concentrations of beta-carotene. The response variable is chosen as BETAPLASMA (plasma beta-carotene level, ng/ml) and the candidate explanatory variables include INT (the intercept), AGE (years), QUETELET (Quetelet index, weight/height 2 ), CALORIES (number of calories consumed per day), FAT (grams of fat consumed per day), FIBRE (grams of fibre consumed per day), ALCOHOL (number of alcoholic drinks consumed per week), CHOLESTEROL (cholesterol consumed per day). The data set also contains categorical variables: SEX (1 = male, 2 = female), SMOKSTAT (smoking status, 1 = never, 2 = former, 3 = current smoker), VITUSE (vitamin use, 1 = yes, fairly often, 2 = yes, not often, 3 = no). We convert these into dummy variables: FEMALE ( = 1 if SEX = 2, 0 otherwise), NONSMOKER ( = 1 if SMOKSTAT = 1, 0 otherwise), FORMERSMOKER ( = 1 if SMOKSTAT = 2, 0 otherwise), FREQVITUSE ( = 1 if VITUSE = 1, 0 otherwise), OCCAVITUSE ( = 1 if VITUSE = 2, 0 otherwise), and also include them as explanatory variables. As in Kai et al. (2011), the index variable is chosen as BETADIET (dietary beta-carotene consumed, mcg per day). We again transform the response and explanatory variables (except the intercept, INT) by the Z-score method defined in (23). As can be seen from the left panel of Figure 6, the index variable BETADIET also exhibits high skewness, so we first transform it by the square-root operator and then the min-max operator in (24). Histograms for the original data for BETADIET as well as the transformed data are given in Figure 6.
We again consider using a functional-coefficient model. In the preliminary kernel estimation, the Epanechnikov kernel K(z) = 3 4 (1 − z 2 ) + is used and the optimal bandwidth is determined via the cross-validation method in Section 4.1. We combine the kernelbased clustering method and penalised local linear estimation (with the tuning parameters λ 1 = 6.5 and λ 2 = 3 chosen by the GIC method) to explore the homogeneity structure among the functional coefficients. Three distinct clusters are identified. The membership of each cluster and the characteristic of the corresponding coefficient function are summarised in Table 12. The pre-clustering estimates of all functional coefficients and the post-clustering and penalised estimates of the cluster-specific functional coefficients are plotted in Figures 7-9.
The kernel clustering and shrinkage estimation results show that FIBRE, NON-SMOKER, FORMERSMOKER, FREQVITUSE form a cluster and their effects on the   Post-clustering estimates of the functional coefficients in Example 6.2 with α k (·), for each k = 1, 2, 3, being the estimated functional coefficient corresponding to the kth cluster listed in Table 12. response variable, the beta-carotene level, are positive, which implies that higher fibre intake, no smoking and frequent vitamin use are helpful for increasing beta-carotene levels.
The variables INT (intercept), AGE, CALORIES, ALCOHOL, CHOLESTEROL, FEMALE, Figure 9. Penalised estimates of the functional coefficients in Example 6.2 with α k (·), for each k = 1, 2, 3, being the estimated functional coefficient corresponding to the kth cluster listed in Table 12. and OCCAVITUSE are found to be insignificant, while QUETELET and FAT are found to have negative effects on beta-carotene levels. As in Example 6.1, we further compare the out-of-sample predictive performance between the preliminary kernel, post-clustering kernel and penalised methods. We randomly divide the full sample (315 observations) into a training set of size 250 and a testing set of size 65, and repeat the random sample splitting 200 times and compute the average MAPE values. The predictions are calculated in the same way as in Example 6.1. The range of bandwidth values considered is between 0.20 and 0.32 with an increment of 0.02. The results are reported in Table 13. From the table, we find that the penalised and postclustering kernel methods provide more accurate out-of-sample prediction in terms of MAPE defined in (25) than the preliminary kernel method, with the penalised method slightly outperforming the post-clustering kernel method when the bandwidth is smaller.

Conclusion
In this paper, we have developed the kernel-based hierarchical clustering method and a generalised version of information criterion to uncover the latent homogeneity structure in the functional-coefficient models. Furthermore, the penalised local linear estimation approach is used to separate out the zero-constant cluster, the non-zero constantcoefficient clusters and the functional-coefficient clusters. The asymptotic theory in Section 3 shows that the estimation for the true number of clusters and the true set of clusters is consistent in the large-sample case. In the simulation study, we find that the proposed estimation methodology outperforms the direct nonparametric kernel estimation which ignores the latent structure in the model. In the empirical application to the Boston house price data and plasma beta-carotene level data, we show that the nonparametric functional-coefficient model can be substantially simplified with reduced numbers of unknown parametric and nonparametric components. As a result, the out-sample mean absolute prediction errors using the developed approach are significantly smaller than those using the naive kernel method which ignores the latent homogeneity structure among the functional coefficients.