Variable selection in finite mixture of median regression models using skew-normal distribution

A regression model with skew-normal errors provides a useful extension for traditional normal regression models when the data involve asymmetric outcomes. Moreover, data that arise from a heterogeneous population can be efficiently analysed by a finite mixture of regression models. These observations motivate us to propose a novel finite mixture of median regression model based on a mixture of the skew-normal distributions to explore asymmetrical data from several subpopulations. With the appropriate choice of the tuning parameters, we establish the theoretical properties of the proposed procedure, including consistency for variable selection method and the oracle property in estimation. A productive nonparametric clustering method is applied to select the number of components, and an efficient EM algorithm for numerical computations is developed. Simulation studies and a real data set are used to illustrate the performance of the proposed methodologies.


Introduction
When the data involve asymmetrical outcomes, inference under the linear regression model with the skewed random errors can be viewed as an alternative procedure to the classical regression models with symmetric errors, since the use of a skewed distribution for the errors could reduce the influence of outliers and thus make statistical analysis more robust. Specifically, suppose that a response variable Y given a set of predictors x takes the form of where β represents a vector of the unknown regression coefficients and the conditional density of the error term given x follows an unknown distribution with the probability density function (pdf) g( | x). It is known that if g( | x) is symmetrical about 0, the estimation of β in (1) will be the same as the coefficients obtained by conventional mean linear regression. However, if g( | x) is skewed, the median regression provides a more reliable statistical analysis with adaptive robustness to outliers, since the median of a distribution is less susceptible to outliers, especially when the data involve asymmetrical outcomes. We here refer the interested readers to Kottas and Gelfand (2001), Zhou and Liu (2016) and Hu et al. (2019) for relevant research on the median regression of population distributions.
It is noteworthy to mention that the median regression has been widely used for studying the relationship between the response variable Y and a set of predictors x in symmetrical distribution, whereas such a median regression may not be suitable for analysing the data exhibiting asymmetrical behaviour or the data that arise from a heterogeneous population. To tackle this difficulty, mixture of regression models (known as switching regression models in econometrics), initially introduced by Goldfeld and Quandt (1973), may be employed as a flexible tool for studying the skewed data from two or more subpopulations. Since then, finite mixture of regression (FMR) models has been widely used in a variety of fields including but not limited to biology, medicine, economics, environmental science, sampling survey and engineering technology. The book by McLachlan and Peel (2004) contains a comprehensive review of FMR models. An FMR model is obtained when a response variable with a finite mixture distribution depends on a set of covariates, and FMR models have been discussed extensively when the normality is assumed for the regression error in each component.
However, it has been shown that the commonly used normal mixture model tends to be an over fitting model, since additional components are usually needed to capture the skewness of the data. To overcome the potential inappropriateness of normal mixtures in some context, we may consider the use of the skew-normal distributions (Azzalini, 1985) as component densities of the errors; see, for example, Wu et al. (2013), Wu (2014), Tang and Tang (2015), and H. Li et al. (2016Li et al. ( , 2017, to name just a few. These observations motivate us to develop a novel finite mixture of the median regression (FMMeR) model based on a mixture of the skew-normal distributions to explore asymmetrical data that arise from several subpopulations. There exist two barriers for the development of the FMMeR model. The first barrier is to deal with computational aspects of parameter estimation when fitting the FMMeR model with the skew-normal distribution for the errors. We tackle this barrier by utilizing the stochastic representation and hierarchical representation (see, for example, Liu & Lin, 2014) of skew-normal mixtures. A second technical barrier is to determine the number of components of the FMMeR model under consideration. Popularly, the log-likelihood maximum and two information-based criteria, AIC (Akaike, 1973) and BIC (Schwarz, 1978), can be used to select the number of components. Although some success has been shown using the model choice criteria, choosing the right number of components for a mixture model is known to be difficult. Thus, we consider a procedure of clustering to determine the number of components, which has been shown to be very effective via real-data example, and it is introduced in Subsection 5.3.
To enhance predictability and to give a concise model, it is reasonable to include only the significant covariates in the model. As a result, variable selection has also become increasingly important for FMR models and a rich literature has been generated in recent several decades. All-subset selection methods, such as the AIC and BIC, and their modifications, have been widely investigated in the context of FMR models; for instance, P. Wang et al. (1996) researched model selection in a finite mixture of Poisson regression models via AIC and BIC. However, all-subset selection methods for FMR models are computationally intensive. To improve computational efficiency, the least absolute shrinkage and selection operator (LASSO) of Tibshirani (1996) and the smoothly clipped absolute deviation (SCAD) method of Fan and Li (2001) are proposed as new methods for variable selection. The penalized likelihood for FMR models, the extension of penalized least square methods, were proposed by Khalili and Chen (2007). Recently, Wu et al. (2020) proposed an estimation and variable selection method for mixture of joint mean and variance models;  proposed variable selection procedures in FMR models using the skew-normal distribution.
The remainder of this paper is organized as follows. In Section 2, we briefly introduce the skew-normal distribution and its median expression. In Section 3, we develop a variable selection method for FMMeR model via the penalized likelihood-based procedure for analysing asymmetrical data from several subpopulations. Section 4 studies asymptotic properties of the resulting estimators. In Section 5, a numerical algorithm, a productive nonparametric clustering method for determining the number of components and a data-adaptive method for choosing tuning parameters are discussed. In Section 6, we carry out simulation studies to investigate the finite sample performance of the proposed methodology. A real-data example is provided in Section 7 for illustrative purposes. Some concluding remarks are given in Section 8. Brief proofs of theorems and some technical derivations are given in Appendices 1 and 2.

Skew-normal distribution
A random variable Y is said to follow a univariate skew-normal distribution with location parameter μ, scale parameter σ ∈ (0, ∞) and skewness parameter λ ∈ R, denoted by Y ∼ SN(μ, σ 2 , λ), if its pdf is given by where φ(·) and (·) denote the pdf and cumulative distribution function (cdf) of the standard normal distribution, respectively. It is worth noting that if λ = 0, the density of Y reduces to a normal density N(μ, σ 2 ) and that the distribution is positively skewed if λ > 0 and is negatively skewed if λ < 0. We represent the skew-normal distribution in an incomplete data framework. Specifically, the stochastic representation for random variable Y ∼ SN(μ, σ 2 , λ) is given by where i = 1, . . . , n with a sample size of n, δ(λ) = λ/ √ 1 + λ 2 . Here, R i ∼ TN(0, 1)I{r i > 0} and V i ∼ N(0, 1), where R i and V i are independent. R ∼ TN(μ, σ 2 )I{a 1 < r < a 2 } is a truncated normal distribution with the density where a 1 < r < a 2 and I{·} represents an indicator function. For notational simplicity, let Y = (Y 1 , . . . , Y n ) and R = (R 1 , . . . , R n ) . Furthermore, the skew-normal distribution can be decomposed into a normal distribution and a truncated normal distribution by a hierarchical representation given by Azzalini and Capitanio (2013) adopted the moment-generating function to calculate the mean and variance for the skew-normal distribution in (2) and they are given by respectively, where μ 0 (λ)= √ 2/π δ(λ) and σ 2 0 (λ)=1 − μ 2 0 (λ). Of particular note is that Lin et al. (2007) introduced a simple way of obtaining higher moments of the skew-normal distribution without the use of its momentgenerating function. Letting m 0 (λ) be the mode of the distribution SN(0, 1, λ), a quite accurate approximation of m 0 (λ) evaluated by the numerical maximization method is given by where sign(λ) indicates the sign function for λ and .
It deserves mentioning that the logarithm of the density for the skew-normal distribution is a concave function and that this property is not altered by a change of location and scale parameters. Thus, m 0 (λ) is unique and the mode of the skew-normal distribution in (2) can be reexpressed as Mode(Y) = μ + m 0 (λ)σ . Mean(Y), Mode(Y) and Median(Y) have the quantitative relationship when the observations follow a skew-normal distribution: which could facilitate the development of the median regression with the skew-normal mixtures discussed below.

Median regression for skew-normal mixtures
In this paper, we assume that the response variable Y i follows a skew-normal distribution with location parameter μ i , scale parameter σ and skewness parameter λ, denoted by Y i ∼ SN(μ i , σ 2 , λ) for i = 1, . . . , n. A linear mode regression model with skew-normal errors can be expressed as where Median(Y i | X) = x i β = μ i + [m 0 (λ) + 2μ 0 (λ)]σ/3 defined by (6). Here X = (x 1 , . . . , x n ) is a p × n design matrix, such that each of its element x i = (x i1 , . . . , x ip ) is the p-dimensional vector of predictors, and β = (β 1 , . . . , β p ) is a p-dimensional vector of the unknown regression coefficients, and = ( 1 , . . . , n ) stands for the n-dimensional vector of random errors such that i We consider the case where the data from heterogeneous populations. A finite mixture median regression (FMMeR) model with m-components of the skew-normal distributions is defined as where . . , ν m ) are the mixing proportions which are constrained to be non-negative and sum to unity, β j = (β j1 , . . . , β jp ) and = (ν 1 , . . . , ν m−1 , β 1 , . . . , β m , σ 1 , . . . , σ m , λ 1 , . . . , λ m ) . It is obvious that which shows that the location in the FMMeR model is altered by a change of scale and skewness parameters.

Identifiability
An important part associated with statistical inference for FMR models is their identifiability. It is well known that mixture models are not absolutely identifiable in general. However, in some mixture model settings, it is possible to establish a weaker sense of identifiability. Titterington et al. (1985) have given relevant conclusions that the FMR models of continuous distribution are identifiable in most cases. Otiniano et al. (2015) introduced the identifiability of finite mixture of skew-normal distribution and gave detailed explanation. The cumulative distribution function of Y is denoted by F Y . It is possible to define the skew-normal family as the set The equality H =H implies m =m and (ν 1 , F 1 ), . . . , (ν m , F m ) are a permutation of (ν 1 ,F 1 ), . . . , (ν m ,F m ). The following theorem given by Atienza et al. (2006) gives a sufficient condition for the identifiability of finite mixtures of distributions. A denotes the accumulation set of A.

Theorem 2.1 (Atienza et al., 2006): Let F be a family of distributions. Let M be a linear mapping which transforms any F
Suppose that there exists a total order ≺ on F , such that for any F ∈ F there exists a point k(F) ∈ S 0 (F) verifying Then, the class H of all finite mixture distributions of F is identifiable.

The method for variable selection
Various classical variable selection criteria can be considered as tradeoffs based on the estimation variance and modelling biases of penalized likelihood. The density f (x) is functionally independent of the parameters as an assumption in FMMeR model when x is random. Hence, the variable selection can be done based absolutely on the conditional density function specified in (2). Denote {(x i , y i )} n i=1 as a sample of observations from FMMeR model specified in (8). The log-likelihood function of is given by A maximum likelihood estimate (MLE) is obtained via maximizing ( ). The MLE is often close to, but not strictly equal to 0 when a component of x is not important. Thus, this covariate is not excluded from the model. To address this problem, according to Khalili and Chen (2007), we define a penalized log-likelihood function as with the penalty function where p τ j (·) is a given penalty function with the tuning parameter τ j ≥ 0 (j = 1, 2, . . . , m), and the tuning parameters and the penalty functions are not necessarily the same for all the parameters. A data-driven criterion for determining tuning parameters is introduced in Subsection 5.2. By choosing appropriate tuning parameters and maximizing function L( ) in (10) to obtain penalized maximum likelihood estimator of , denoted by , the coefficients in the vicinity of 0 are compressed to 0 and automatically excluded. Thus, the procedure combines the parameter estimation and variable selection and reduces the computational burden substantially. We use the following three penalty functions to illustrate the theory that we develop for the FMMeR model: Following the idea of Fan and Li (2001), we set a = 3.7 for application purposes in this article. The LASSO penalty has a good performance in numerical computation because of its convex property. The SCAD penalty gives a good performance in selecting important variables. HARD penalty should work more like SCAD, although less smoothly.

Asymptotic properties
In this section, we consider the consistency for variable selection method and the oracle property in estimation. Without loss of generality, the coefficient vector where β 1j and β 2j contain the nonzero effects and zero effects of β j , respectively. Naturally, we also split the parameter = ( 1 , 2 ) such that 2 contains all zero effects, that is, β 2j in the true model. The vector of true parameters is denoted as 0 . The components of 0 are denoted with a superscript, namely where 1 ≤ t ≤ d j and 1 ≤ j ≤ m. p τ j (β 0 jt ) and p τ j (β 0 jt ) are the first and second derivatives of the penalty function p τ j (β 0 jt ) with respect to β 0 jt , respectively. The asymptotic results obtained in this article are based on the three conditions on the penalty functions p τ j (·).
C 0 : For all j, p τ j (0) = 0, and p τ j (β) is symmetric and non-negative. Furthermore, it is nondecreasing and twice differentiable for all β ∈ (0, ∞) with at most a few exceptions.
Condition C 1 is used to explain the asymptotic properties of the estimators of nonzero effects. Conditions C 0 and C 2 are required for sparsity.
. . , n, be a random sample from a density function f (h, ) that satisfies the regularity conditions R 1 -R 4 in the Appendix 1. The penalty functions p τ j (·) satisfy conditions C 0 and C 1 as a assumption. Then there exists a local maximizer of the penalized log-likelihood function L( ) for which where · represents the Euclidean norm. (a) For any such that − 0 = O p (n −1/2 ), with probability tending to 1, (2) asymptotic normality: where I 1 ( 01 ) is the Fisher information computed under the true model with all zero effects removed.
Brief proofs of theorems are put in Appendix 1. Detailed proofs can be seen in the previous literature (see, for example, Fan & Li, 2001;Khalili & Chen, 2007;.

Maximization algorithm
In general, due to the unboundedness of the likelihood function, the maximum likelihood estimator of the mixture distribution is often inconsistent in the context of finite mixture models. The alternative is to add a regular term that prevents the likelihood function from tending to infinity to get a consistent maximum penalty likelihood estimator, see, for example, Chen andTan (2009), Chen (2017), including recent works, Chen et al. (2020), He andChen (2022a, 2022b). McLachlan and Peel (2004) proposed that the EM algorithm can calculate the maximum likelihood estimation of arbitrary distribution in finite mixture model. We maximize the regularized log-likelihood function by the EM algorithm. We define the latent component-indicators Then Z i is an m-dimensional indicator vector with its j-th element given by Since an observation cannot simultaneously belong to both components, we have m j=1 z ij = 1. By assuming the component-indicators Z 1 , . . . , Z n to be independent, we obtain a conditional density of the multinomial distribution given the mixing probabilities which is denoted as Z i ∼ M(1; ν 1 , . . . , ν m ), and it will be used in combination with (3) to generate the following hierarchical representation for the skew-normal mixtures, such that It deserves mentioning that the hierarchical representation of the finite skew-normal mixtures in (12) allows us to address computational barriers of the parameter estimation when fitting the FMMeR model.
where Y mis denotes the missing data. From hierarchical representation (12), the complete data log-likelihood function can be given by Similar to the approach in Fan and Li (2001), p( ) is replaced by the following local quadratic function given the value (0) , This approximation is used in the M-step of the EM algorithm in each iteration. The complete penalized log-likelihood function of (10) can be given by • E-step. The E-step computes the conditional expectation of the function L c ( ) with respect to z ij . Given the observed data {x i , y i } n i=1 from FMMeR model (8), (k) is denoted as parameter estimation for k-th iteration. Let θ = (β , σ , λ) . Then the surrogate function can be constructed as where The required conditional expectations are obtained as follows. First, the conditional expectation E (k) (z ij | y i , x i ) is given by Then, it can be easily shown that • M-step. The M-step calculates parameter vector (k+1) via maximizing Q( ; (k) ) with respect to . Thus, on the (k + 1)-th iteration of the EM algorithm, the mixing proportions are updated by It is worth noting that the mixing proportions modelling should be considered in mixture of experts regression models, which can be found in Yin, Wu, Lu, et al. (2020). To improve the efficiency for selecting the number of components in real data analysis for this article, we firstly applied a clustering method to determine the optimal number of components in Subsection 5.3. By maximizing Q( ; (k) ) with respect to without ν j , namely maximizing Q 2 (θ; (k) ), we can compute θ (k+1) j . To obtain parameter estimation of FMMeR model without penalty, start from an initial value θ (0) and given k as the current iteration. We use the following method to update where is referred to as score function without penalty. H(θ (k) ) is an observation information matrix defined as Detailed derivation can be seen in Appendix 2. We iterate between the E-steps and M-steps until algorithm converges, and the estimators β j are obtained. In order to find the non-significant variables and simplify the FMMeR model, we shrink the coefficients by the penalty function. β (0) j is taken as the initial value of iteration and given k as the current iteration, update

Choice of the tuning parameters
The degree of penalty is controlled by tuning parameters. When using the method introduced in this article, we need to choose the tuning parameters. Various selection criteria, including cross-validation (CV), generalized crossvalidation (GCV), Akaike information criterion (AIC) (Akaike, 1973) and Bayesian Information Criterion (BIC) (Schwarz, 1978), are often used for choosing tuning parameters. GCV has a non-negligible overfitting effect in the final model selection. H. Wang et al. (2007) suggested using BIC for the SCAD estimator in linear models and partially linear models and proved the consistency of the selection method, that is, the optimal parameter chosen by BIC can identify the true model with probability tending to 1. Considering the maximizer n of the log-likelihood function (13), we use the estimator n to calculate the mixing proportions in (17). The mixing proportions remain fixed throughout the tuning parameter selection process. For a given value of tuning parameter τ j , let (β j ,σ j ,λ j ) be the maximum regularized likelihood estimates of the parameters in the j-th component of the FMMeR model fixing the remaining elements of at n . Denote the likelihood-based deviance statistics, evaluated at (β j ,σ j ,λ j ), corresponding to the j-th component of FMMeR model as whereμ ij = x iβj −σ j 3 [m 0 (λ j ) + 2 2 π δ(λ j )] and the weights ω ij are given in (16). Then we define where N(τ j ) is the number of nonzero elements of the vectorβ j and n j = n i=1 ω ij . It is expected that the choice of τ jt should be such that the tuning parameter for a zero coefficient is larger than that for a nonzero coefficient. Thus, we can simultaneously unbiasedly estimate the larger coefficient, and shrink the small coefficient towards zero. Hence, similar to Wu et al. (2013), we suggest jt is the MLE of β jt . β jt , τ jt are the t-th component of β j and τ j , respectively. Tuning parameters are obtained via calculatingτ j = arg min τ j BIC(τ j ).

Determining the number of components
Determining the number of components of an FMR model is a challenge. In the above discussion, we assume that m is known and processing methods are either based on prior information or pre-analysis of data. A feasible method implements reversible jump Markov chain Monte Carlo (RJMCMC) Richardson and Green (1997), since adding skewness even complicates matters, we did not pursue RJMCMC. Moreover, the component posterior probabilities evaluated in mixture modelling for Bayesian inference can be readily used as a soft clustering scheme. Alternatively, the log-likelihood maximum and two information-based criteria, AIC and BIC, can be used to select the number of components. Although some success has been shown using the model choice criteria, choosing the right number of components for a mixture model is known to be difficult.
To improve the efficiency for selecting the number of components in this article, a productive nonparametric clustering method via mode identification is applied, see J. Li et al. (2007). It deserves mentioning that this approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Specifically, a cluster is formed by those sample points that ascend to the same local maximum of the density function, and a pairwise separability measure for clusters is defined using the Ridgeline between the density bumps of two clusters. In this process, the Modal EM (MEM) algorithm and Ridgeline EM (REM) algorithm are used. Numerical results in Section 7 illustrated that this clustering procedure works well for determining the number of components in the FMMeR model.

Numerical experiments
In this section, we carry out simulation studies to investigate the finite sample performance of the proposed methodology. To be more specific, in Subsection 6.1, we conduct simulations to study the impact of the sample size on the estimation quality, and in Subsection 6.2, we investigate the quality of the performance for variable selection over different values of the skewness, and we compare the performance of the proposed FMMeR model and NMR model used in Khalili and Chen (2007) in Subsection 6.3.

Experiments 1
The experiment works to observe the impact of the sample size on the estimation quality. In addition, we compare the performance of different variable selection methods from a number of angles. We generated independently samples of size n from the following FMMeR model with two components, where μ i1 and μ i2 are defined by (9), and = (ν 1 , β 1 , β 2 , σ 1 , σ 2 , λ 1 , λ 2 ) . The components of the covariate x in the simulation are generated from a uniform distribution U(−1, 1). The true values of parameters are set to be β 1(0) = (1, 0, 0, −1.5, 0) , β 2(0) = (−1, 0, 1, 0, 1.2) , σ 1(0) = σ 2(0) = 2. To test the sensitivity of the FMMeR model for positively or negatively skewed data, we set λ 1(0) = 3 and λ 2(0) = −3. A choice of mixing proportion ν 1 = 0.5 and 0.35 is considered, and y is generated according to model (19). According to Karlis and Xekalaki (2003), a faster convergence rate can be achieved by setting the true value of the parameter to the initial value of the iteration. The performance of the estimatorsβ,σ ,λ andν will be assessed using the Mean Squared Error (MSE), defined as The average numbers of correctly (C) and incorrectly (IC) estimated zero coefficients and their standard deviation (SD) based on 500 repetitions are presented in Table 1. The results are presented in terms of mixture components 1 and 2. In addition, we report the MSEs and SD of scale, skewness and mixing proportion for ν 1 = 0.5 across the repetitions in Table 2. Note that when the sample size n increases, as expected, the methods improve for a given penalty. The MSEs of estimatorsβ,σ ,λ andν tend to decrease by increasing the sample size, which illustrates the Table 1. Three penalty functions are used for variable selection procedure. Note: The first column indicates the penalty function used for variable selection method and the second column indicates the component. convergence property of the maximum penalized likelihood estimator of FMMeR model. For a given n, the performances of SCAD and HARD methods are similar for model complexity and better than LASSO method. When mixing proportion ν 1 reduces, and the sample size for component 1 decreases, all procedures for the component 1 of the FMMeR model become less satisfactory. Furthermore, the performances of component 1 and component 2 are similar for ν 1 = 0.5, which indicates that FMMeR model is insensitive to positively or negatively skewed data.

Experiments 2
To investigate how the estimation quality is changed over different skewness, in this section, we set mixing proposition ν 1 = 0.5 and the number of observations n = 400. Observations are generated in the same way as in Experiment 1. Table 3 shows C, IC, MSE(β) and their SD for different penalty function with λ = −3, −1.5, −0.5, 0.5, 1.5, 3 for 500 repetitions. Notice that the variable selection procedures perform similarly in all three cases for a given penalty function, but a larger SD is obtained by LASSO. When combined with the relevant conclusions of Experiment 1, the result indicates that the performance of the variable selection method is not affected by the choice of skewness of data.

Experiments 3
To demonstrate the ability of the proposed variable selection method at selecting important variables, we compare the performance of the proposed FMMeR model and NMR model used in Khalili and Chen (2007) for a varying sample size n = 200, 400 and ν 1 = 0.5. The data are generated exactly in the same way as in Experiment 1, and each of the two models is considered for the inference. The simulated results are reported in Table 4 based on 500 repetitions. From Table 4, it is clear that the performance of the variable selection procedure based on the FMMeR model is better than that based on the NMR model in some settings. This confirms that the FMMeR model clearly outperforms the NMR model at identifying important variables when there is skewness in the data. As expected, the MSEs indicate the convergence property of the maximum penalized likelihood estimator of FMMeR and NMR models.

A real-data example
FMR models have been used in the fields of biomedicine. To further demonstrate the ability of the proposed FMMeR model and variable selection method at identifying significant variables, we use a real-data example to illustrate the practical application of the proposed method of the FMMeR model in this section. The data set, analysed by Cook and Weisberg (1994), focused on the body mass index (BMI) for 102 male and 100 female athletes collected at Australian Institute of Sport. We are interested in the relationship between BMI and the 10 performance measures given as red cell count (x 1 ), white cell count (x 2 ), haematocrit (x 3 ), haemoglobin (x 4 ), plasma ferritin concentration (x 5 ), sum of skin folds (x 6 ), body fat percentage (x 7 ), lean body mass (x 8 ), height (x 9 ) and weight (x 10 ). It can be seen from the histogram of the BMI in Figure 1 that the response is right-skewed, indicating the preference of the model with the skew-normal random errors. We determine the number of components via the method in Subsection 5.3. The clustering results are shown in Figure 2. At the level 3, 4 clusters are formed, as shown by different symbols in Figure 2(a). The 4 modes identified at level 3 are merged into 2 modes at level 4, as shown in   Figure 2 (b,d). Compared with level 4, two influential observations were excluded in cluster 1 and cluster 2 of level 3. Thus, it seems reasonable to use the following FMMeR model with two components to fit the BMI data, where μ i1 and μ i2 are defined by (9), and = (ν 1 , β 1 , β 2 , σ 1 , σ 2 , λ 1 , λ 2 ) . x i is a 10 × 1 vector consisting of all 10 potential variables. Three penalty functions are used to select significant variables. We compare the variable selection results of the three models, including the proposed FMMeR model in this article, finite mixture of modal liner regression model and NMR model, where modal liner regression (MODLR) model was proposed by Yao and Li (2014). The results of variable selection for three models are given in Tables 5-7. In this data example, three variable selection procedures for a given model perform very similarly in terms of selecting significant variables. For FMMeR model and finite mixture of MODLR model, the same variables are removed for a given penalty function. NMR model, however, reserves more variables, resulting in a failure to select significant variables. Thus, the true structure of the model is not identified. When there is a situation of skewed data, the performances of HARD and SCAD are better than LASSO for identifying the authentic structure of the model. In FMMeR model, seven significant variables, including x 1 , x 4 , x 5 , x 7 , x 8 , x 9 , x 10 , are identified in component 1. Also seven x 4 , x 5 , x 6 , x 7 , x 8 , x 9 , x 10 are contained in component 2. This indicates that these variables have a significant effect for BMI of athletes. We also find that there are some variables having different effects on parts one and two. For instance, red cell count (x 1 ) and sum of skin folds (x 6 ) are another factors affecting athletes' BMI in component  1 and component 2, respectively. Furthermore, x 4 , x 5 , x 7 and x 8 are helpful for achieving a high BMI in two conponents. In addition, the performance of the variable selection procedure via the FMMeR model is different from that of the variable selection procedure via the NMR model.

Conclusions remarks and future works
In this paper, by utilizing the skew-normal distribution as a component density to overcome the potential inappropriateness of normal mixtures in some context, we have developed a novel finite mixture of the median regression (FMMeR) model to explore asymmetrical data that arise from several subpopulations. Thanks to the stochastic representation for the skew-normal distribution, we have constructed a hierarchical representation of the finite skew-normal mixtures to address computational barriers of the parameter estimation and variable selection when fitting the FMMeR model. In addition, in order to determine the number of components, we applied a clustering method via mode identification proposed by J. Li et al. (2007) and a good performance is shown. Numerical results from simulation studies and a real-data example illustrated that the proposed FMMeR model methodology performs well in general, even when the data exhibit symmetrical behaviour. It is worth noting that we only considered the procedures of parameter estimation and variable selection for the FMMeR model based on a mixture of the skew-normal distributions. Meanwhile, the scenario of p > n has not been considered in this paper. A natural extension of the proposed methodology is to consider other skewed distributions, such as the skew-t and skew-Laplace distributions, and high-dimensional settings. In addition, another research direction is to model the mixing proportions ν, which extends the proposed model to the framework of mixture of experts models. Finally, it will also be of interest to consider Bayesian variable selection, semi-parametric and nonparametric methods for the FMMeR model, which are currently under investigation and will be reported elsewhere.

Disclosure statement
No potential conflict of interest was reported by the author(s). [nξ n p τ j (β 0 jt )u I + nξ 2 n p τ j u I 2 {1 + o(1)}]
In a shrinking neighbourhood of 0, |β jt |O p (n 1/2 ) < nν j p τ j (β jt ) in probability by condition C 2 . This completes the proof of part (a). To prove sparsity in part (b(1)), we consider the partition = L( 1 , 2 ). Let ( 1 , 0) be the maximizer of the penalized loglikelihood function L( , 0), which is considered as a function of 1 . It suffices to show that in the neighbourhood − 0 = O p {n −1/2 }, L( 1 , 2 ) − L( 1 , 0) < 0 with probability tending to 1 as n → ∞. By the result in part (a), we obtain that To prove asymptotic normality in part (b(2)), we consider L( , 0) as a function of 1 . Using the same argument as in Theorem 4.1, there exists a √ n-consistent local maximizer of this function, denoted by 1 , that satisfies By substituting the first-order Taylor's expansions of ∂ ( )/∂ 1 and ∂p( )/∂ 1 into the above expression, we have