Optimal model averaging estimator for multinomial logit models

In this paper, we study optimal model averaging estimators of regression coefficients in a multinomial logit model, which is commonly used in many scientific fields. A Kullback–Leibler (KL) loss-based weight choice criterion is developed to determine averaging weights. Under some regularity conditions, we prove that the resulting model averaging estimators are asymptotically optimal. When the true model is one of the candidate models, the averaged estimators are consistent. Simulation studies suggest the superiority of the proposed method over commonly used model selection criterions, model averaging methods, as well as some other related methods in terms of the KL loss and mean squared forecast error. Finally, the website phishing data is used to illustrate the proposed method.


Introduction
Model selection is a traditional data analysis methodology. By minimizing a model selection criterion, such as Akaike information criterion (AIC) (Akaike, 1973), Bayesian information criterion (BIC) (Schwarz, 1978) and Mallow's C p (Mallows, 1973), one model can be chosen from a number of candidate models. After that, one can make statistical inferences under the selected model. In this progress, we ignore the additional uncertainty or even bias introduced by the model selection produce, and thus often underreport the variance of inferences, as discussed in H. Wang et al. (2009). Instead of focusing on one model, the model averaging approach considers a series of candidate models and gives higher weights to the better models. It is an integrated progress that avoids ignoring the uncertainty introduced by the model selection procedure and reduces the risk in regression estimations.
Model averaging can be classified as Bayesian model averaging (BMA) and Frequentist model averaging (FMA). Compared with the FMA approach, there has been an enormous literature about the BMA method, See Hoeting et al. (1999) for a comprehensive review. Unlike the BMA approach which considers the model uncertainty by giving a prior probability to each candidate model, FMA approach does not require priors and the corresponding estimators are totally determined by the data itself. Therefore, the current studies pay more attention to the FMA approach in statistics and econometrics.
In recent years, optimal model averaging methods have received a substantial amount of interests. Hansen (2007) proposed a Mallows model averaging (MMA) method for linear regression models with independent and homoscedastic errors. He developed its asymptotic optimality for a class of nested models by constraining the model weights in a special discrete set. A. T. Wan et al. (2010) provided a more flexible theoretical framework for the MMA which kept its asymptotic optimality for continuous weights and non-nested models. Hansen and Racine (2012), Liu and Okui (2013) developed a jackknife model averaging (JMA) method and heteroscedasticity-robust C p model averaging (HRC p ) for the linear regression with independent and heteroscedastic errors, respectively. Zhang et al. (2013) broadened the JMA to the linear regression with dependent errors. Cheng et al. (2015) provided a feasible autocovariance-corrected MMA method to select weights across generalized least squares for the linear regression with time series errors. Zhu et al. (2018) proposed the MMA for multivariate multiple regression models.
Hansen's approach and the subsequent extensions listed above mainly focus on linear models. Recently, some optimal model averaging literatures for nonlinear models have also been developed, including optimal model averaging criterion for partially linear models (Zhang & Wang, 2019), quantile regressions (Lu & Su, 2015), generalized linear models and generalized linear mixed-effects models (Zhang et al., 2016), varying coefficient models (Li et al., 2018), varyingcoefficients partially linear models (Zhu et al., 2019), spatial autoregressive models (Zhang & Yu, 2018), and others. All of these methods are asymptotically optimal in the sense of achieving the lowest loss in the large sample case. To the best of our knowledge, there are few optimal averaging estimations for a multinomial logit model that allows all candidate models to be possibly misspecified. The main contribution of this paper is to fill this gap. The multinomial logit model is widely used in marketing research (Guadagni & Little, 1983), risk analysis (Bayaga, 2010), credit ratings (Ederington, 1985) and other fields including categorical data. A. T. Wan et al. (2014) developed the 'approximately optimal' (Aopt) method for the multinomial logit model under a local misspecification model assumption but did not establish asymptotic optimality of the resulting model averaging estimator. Besides, there have been many debates concerning the realism of the local misspecification assumption, e.g., Raftery and Zheng (2003). After that, S. Zhao et al. (2019) proposed M-fold crossvalidation (MCV) criterion for the multinomial logit model and yielded forecasting superior to the strategy proposed by A. T. Wan et al. (2014). Then, its asymptotic optimality is proved for the dimension of covariates being fixed.
These two papers for multinomial logit models listed above both concerned a squared estimation error-based risk. Different from squared errors, the KL loss was produced to measure the closeness between the model and the true data generating process. Then, there are amounts of criterion developed from the KL loss, such as Generalised information criteria (GIC) (Konishi & Kitagawa, 1996), Kullback-Leibler information criteria (KIC) (Cavanaugh, 1999) and an improved version of a criterion based on the Akaike information criterion (AIC c ) (Hurvich et al., 1998). In addition, Zhang et al. (2015) clarified that the model averaging methods based on the KL loss yield better forecasts than these model averaging approaches in terms of squared errors under linear regressions. Motivated by these facts, to propose a novel model averaging method based on the KL loss seems to be potentially interesting. Our simulation study demonstrates that the model averaging method based on the KL loss has strong competitive advantages than the model averaging strategy by considering the squared estimation error for the multinomial logit model.
In order to develop an optimal model averaging method for the multinomial logit model, the weights are obtained through minimizing the KL loss. That is, we use a plug-in estimator of the KL loss plus a penalty term as the weight choice criterion, which is equivalent to penalizing the negative log-likelihood. It is interesting to note that this criterion reduces to the Mahalanobis Mallows criterion of Zhu et al. (2018) where they assume that the distribution of multiple responses is multivariate normal. The asymptotic optimality based on the KL loss of the proposed method is built on the consistency of estimators in misspecified models which is more flexible than the above-mentioned local misspecification assumption. Moreover, the asymptotic optimality will be established for the dimension of covariates being either fixed or diverging.
This article is the first study that proposes optimal model averaging estimation for multinomial logit models based on the KL loss. When the number of candidate models is small, the corresponding numerically solutions obtained are nearly instantaneous. If the number of candidate models is large, the computational burden of our model averaging procedure will be heavy. In this case, a model screening step prior to model averaging is desirable. That is, we use penalized regression with LASSO (Friedman et al., 2010) to prepare candidate models. Different tuning parameters may results in different models, which will be included in our resulting candidate models. Using the website phishing data, we demonstrate the superiority of our proposed method.
Our work is related to Zhang et al. (2016), which developed the model averaging method for univariate generalized linear models. We differ from this study by establishing the asymptotic optimality based on some original conditions, while they prove the asymptotic optimality by assuming some conclusions are valid. Moreover, we discuss the case when the true model is one of the candidate models and prove that the model averaging estimators based on our weight choice criterion are consistent.
The remainder of this article is organized as follows. In Section 2, we first describe the multinomial logit model. Then, we introduce the model averaging estimation for the multinomial logit model and propose a weight choice criterion by considering the KL loss. The asymptotic optimality of the proposed method and the estimation consistency are discussed in Sections 3 and 4, respectively. Sections 5 and 6 present the numerical results through various simulations and a real data example, respectively. Technical proofs of the main results are presented in the Appendix.

Multinomial logit model
Consider a general discrete choice model with n independent individuals and d nominal alternatives. And y i = j means individual i selects alternative j. We use a multinomial logit regression to describe the discrete choice model (A. T. Wan et al., 2014). The corresponding assumption is that the log odds of category j relative to the reference category (without losing generality, we regard alternative d as reference) are determined by a linear combination of regressors. Thus, the choice probabilities for the ith individual can then be expressed as where X is an n × k covariate matrix with full column rank, X i is constructed from the ith row of X, and β j is an unknown parameter vector. We first assume k to be fixed and discuss the diverging situation in Section 3. Formula (1) can be rewritten as an exponential family form.
where θ i = (θ i1 , . . . , θ i(d−1) ) T is a parameter vector, with the canonical parameter θ ij connecting the parameters β j and the k-dimension covariate vector in the form θ ij = X i β j . And

Model averaging estimator
We denote a set of S candidate models M 1 , . . . , M S by where S is fixed, and X (s) is an n × k s matrix containing k s columns of X with full column rank, whose rows are X (s),1 , . . . , X (s),n . Under the sth candidate model, the maximum-likelihood estimator of the regression coefficients isβ s . And letβ s ∈ R (d−1)×k s be the subvector containing estimators inβ (s) ∈ R (d−1)×k . Note that the rest components ofβ (s) are restricted to be zeros. Let θ 0i be the true value of θ i . And θ 0i is not required that there exists a β 0 so that θ 0i = (I d−1 ⊗ X i )β 0 . Thus, each of the candidate models can be misspecified. After their maximum-likelihood estimators are obtained, we need to determine the weight of each candidate model. Let ω = (w 1 , . . . , w S ) be a weight vector in the unit simplex of R S : H = {ω ∈ [0, 1] S , S s=1 w s = 1}. Then the model averaging estimator ofβ(ω) isβ(ω) = S s=1 w sβ(s) . Let Y = (T(y 1 ), . . . , T(y n )) T , parameter matrix and 0 = (θ 01 , . . . , θ 0n ) T be the true value of . We put the model estimator in vector form by using the vectoring operation Vec(·), which creates a column vector by stacking the column vectors of below one another. Then, the model averaging estimator can be expressed as

KL loss-based weight choice criterion
For linear models, the weight choice criterion is based on squared prediction error. In this paper, we use the KL divergence as a replacement for the squared prediction error to establish the asymptotic optimality. The KL loss of {β(ω)} is . Because of the relationship between J(ω) and KL(ω), we can obtain ω to minimize J(ω) instead of KL(ω). In practice, minimizing J(ω) is infeasible because the value of U is unknown. A intuition idea is that we can plug Y into J(ω) instead of U. That is, we can get ω by minimizing J * (ω) = B{β(ω)} − Vec(Y T ) T Vec( T {β(ω)}). But, this progress will lead to overfitting. Motivated by that J * (ω) equals the corresponding negative log-likelihood of Vec( T {β(ω)}) = Zβ(ω). We add a penalty term λ n (d − 1)ω T K to 2J * (ω), where K = (k 1 , . . . , k S ) T , and k s is the number of columns of X used in the sth candidate model. And the weight choice criterion is introduced as The resultant weight vector is defined asω = arg min ω∈H ℘ (ω). Because ℘ (ω) is convex in ω, the global optimization can be performed efficiently through constrained optimization programming. For example, the fmincon of MATLAB can be applied for this purpose. Note that when we restrict one element of ω to 1 others 0, then ℘ (ω) is equivalent to AIC and BIC in the sense of choosing weights when λ n = 2 and λ n = log(n), respectively. In addition, when λ n = 2, the criterion ℘ (ω) can reduce to the Mahalanobis Mallows criterion of Zhu et al. (2018) where they assume that the distribution of multiple responses is multivariate normal.

Asymptotic optimality
This section presents the main theoretic results of this paper, which demonstrates the asymptotic optimality of the model averaging estimator {β(ω)}. We define the pseudo true regression parameter vector as β * (s) which minimizes the KL divergence between the true model and the sth candidate model. From Theorem 3.2 of White (1982), when the dimension of β * (s) , k, is fixed, under some regularity conditions, we have Before we provide the relevant theorems, we first list the relevant notations in this paper.
where Cov(·), λ max {·} and λ min {·} denote the covariance, the maximum and minimum eigenvalues of a matrix, respectively. Note that all the limiting properties here and throughout the text hold under n → ∞.
The following conditions will be made: R.1 There exist constants C and C, such that 0 < C < λ min {X T X/n} < λ max {X T X/n} < C. R.2 max i∈{1,...,n} X i 2 /n → 0, and there exist constants C 1 and C 2 , such that 0 Remark 3.1: Conditions R.1 and R.2 are regular. Condition R.1 is the same as condition C.2 and the second part of condition C.3 of Zhang et al. (2020). The first part of Condition R.2 is the same as the first part of condition C.3 of Zhang et al. (2020). The second part of Condition R.2 assumes the covariance matrix of i is positively definite. Condition R.3 requires ξ n grows at a rate no slower than n 1 2 . And both λ n = 2 and λ n = log(n) satisfy Condition R.4, which means that if one prefers AIC or BIC, this can be achieved by choosing λ n = 2 or λ n = log(n). Condition R.4 is also used by Theorem 2 of Zhang et al. (2020).
The following theorem illustrates the asymptotic optimality of the model averaging estimators for fixed k situation.
Theorem 3.1: For fixed k, if Equation (5) and the regularity Conditions R.1-R.4 hold. Thenω is asymptotically optimal in the sense that where the convergence is in probability.
Remark 3.2: Theorem 1 of S. Zhao et al. (2019) is based on the squared loss. The squared loss only concerns on the point distance. Different from them, the KL loss measures the closeness between the model and the true data generating process, which concerns on the full distribution. In addition, from the KL loss pays more attention to these points with high probability. However, the squared loss considers all points are equally important.
We list the following conditions required for the case with diverging k.
There exists a constant C 0 > 0 such that for any fixed δ > 0, any β s ∈ B n (β * s | δ) and every s = 1, . . . , S, the minimum eigenvalue of 1 Lu and Su (2015). Condition R.6 guar- which is an extension of the first part of condition C.4 of Zhang et al. (2016). Condition R.7 is an extension of Condition R.3 under the diverging k situation. Condition R.7 allows k to increase with n, but restricts its rate. Obviously, as k increases, ξ n decreases. Therefore, the smaller k is easier to satisfy Condition R.7. In practice, we can exclude redundant variables from the candidate set prior to model averaging to control k. Remark 3.4: Note that both λ n = 2 and λ n = log(n) satisfy Condition R.4. In Section 4, the numerical analysis shows that both of them outperform alternative model selection methods (AIC and BIC) and model averaging methods (Smoothed AIC and Smoothed  BIC), respectively. And, when the sample size is small, the optimal value of λ n increases as the level of the model misspecification improves.

Estimation consistency
Here we would like to comment on the case when the true model is included in the candidate models. That is, where β 0 ∈ R (d−1)×k is the true value of β and the number of non-zero coefficients of β 0 is k true . Let ω true be a weight vector in which the element corresponding to the true model is 1, and all others are 0. When k is fixed, from chapter 3.4.1 of Fahrmeir and Tutz (2013), under some regularity conditions, we have For diverging k, from Theorem 2.1 of Portnoy (1988), under some regularity conditions, we can obtain In order to prove the estimation consistency, we further impose the following condition.
We now describe the performance of the weighted estimator when the true model is among the candidate models.

Monte Carlo simulations
In this section, we evaluate the empirical performance of our proposed model averaging strategy for the multinomial logit model. We use two versions of our proposed model averaging method named OPT1-KL with λ n = 2 and OPT2-KL with λ n = log(n) to compare with some alternative FMA methods and model selec- to the sth model and SBIC allocates weights similarly. We use the KL loss for assessment and generate 1000 simulated data. For the convenient comparison, we normalize all KL losses by dividing the KL loss corresponding to the best method. The sample size varies at 100, 200. Note that MCV and A-OPT are the model averaging methods to average estimate of the probability y i = j. Which leads to computing the KL loss is infeasible. Therefore, we compare our methods with MCV and A-OPT in terms of the mean squared forecast error (MSFE) instead of the KL loss. Without loss of generality, we also normalize all MSFEs by dividing the MSFE corresponding to the best method.
Two situations of simulations are used. In the first situation, when the candidate models do not contain the true model, we examine the effect of the changing magnitude of coefficients and the changing level of the model misspecification. Moreover, we consider the case when the number of covariates is diverging with the sample size. Note that all candidate models are misspecified in this situation, so there does not exist the full model. It implies that the A-OPT method is not feasible for this situation. In the second situation, when the candidate models include the true model, we compare our methods with other competitive methods and validate the estimation consistency.
Setting 1. We generate y i from the setup of the multinomial logit model (2) with the following specifications: d = 3, X i1 = 1, X ij , j = 2, . . . , 6 follow normal distributions with mean zeros, variance ones and the correlation between different components of X i is ρ = 0.75, and In order to imitate that all candidate models are misspecified, we pretend the last covariate missed. Let X i1 contain in all candidate models. So there are 2 4 − 1 = 15 candidate models to combine. The parameter γ 1 is used to control the magnitude of coefficients, and we let it vary in the set {0.5, 1, 2}.
Simulation results are shown in Table 1. One remarkable aspect of the results is that OPT1-KL and OPT2-KL yield smaller mean and standard deviance (SD) values than the other four competitions (SAIC, SBIC, AIC and BIC) in different magnitudes of coefficients. In the majority of cases, FMA approaches are superior to model selection methods. This pattern appears to be more obvious when γ 1 is small than when it is large. For example, when γ 1 = 0.5, all model averaging methods deliver smaller mean values than model selection strategies. When γ 1 increases to 2, for n = 200, AIC has marginal advantages than SBIC regarding the mean values. This result is reasonable, because when γ 1 is small, and the non-zero coefficients in the true model are all close to zero, which makes it difficult to distinguish the non-zero coefficients from a false model that contains many zeros. As a result, model selection criterion scores can be quite similar for different candidate models and the choice of models becomes unstable. On the other hand, when the absolute values of the non-zero coefficients are large, and a model selection criterion can identify a non-zero coefficient more readily.  In addition, we calculate the means of KL(ω)/ inf ω∈H KL(ω)(ratio) by methods of OPT1-KL and OPT2-KL with γ 1 = 1. The values of mean, presented in Figure 1, decrease and approach to 1 as the sample size n increases. This feature verifies asymptotic optimality numerically. Then, we compare our strategy with MCV in terms of MSFE. We generate 100 observations as the training sample and 10 observations as the test sample under setting 1 with γ 1 = 1. And the simulation results are based on 1000 replications.
wherep [r] vj is the forecast of p [r] vj , which represents the probability that the vth test observation selects alternative j for the rth replication. Table 2 shows that our proposed approaches outperform other strategies. Then, SAIC and SBIC perform better than MCV. Note that MCV is the model averaging method based on the squared loss, and our strategy is based on the KL loss. It implies that the approach based on the KL loss has a strong competitive advantage than this approach based on the square loss for a multinomial logit model. Setting 2. In order to examine the effects of the changing level of the model misspecification, we where U i = (1, X i2 , . . . , X i5 ), X i2 , . . . , X i6 have the same specification as the previous design, γ 2 controls  the level of the model misspecification, we study it in the set {0.25, 0.5, 0.75}. We take the multinomial logit model (3) to fit the data. We still omit the last covariate X i6 and consider S = 2 4 − 1 candidate models. The simulation results under different levels of the model misspecification are shown in Table 3. It is seen that OPT1-KL and OPT2-KL always deliver better performances than their competitors SAIC/AIC and SBIC/BIC in terms of mean values, respectively. Focusing on SD values, OPT1-KL always performs much better than SAIC and AIC, and OPT2-KL outperforms SBIC and BIC in most cases. It demonstrates the superiority of our methods.
In addition, we explore our strategies with other values of λ n differing from 2 and log(n). That is, we vary λ n from 0.5 to n 0.4 . The simulation results are presented in Figure 2. For cases of γ 2 = 0.25, 0.5 and 0.75, when n = 100, the means of KL loss are minimized at λ n = 2.5, λ n = 2.75 and λ n = 3.25, respectively. It states that when the level of the model misspecification improves, the optimal value of λ n increases slightly for the small sample size. For a larger sample size n = 200, the optimal values of λ n are same for all cases with λ n = 2.
Setting 3. This setup discusses the case when the number of covariates is diverging. The data generate progress is the same as those in setting 1. Except that we adapt the covariance matrix to R = (r ij ) with r ij = 0.40 |i−j| , for that the model screening method is not suitable for the case when the covariances have strong dependence which is implied by the first part of Lemma 3.2 in Ando and Li (2014). Then, β 1 and β 2 are chosen according to the following cases: (1, 1.2, 0, 0, 1.5, 0, 0, 1.1, 0, 0, 0.1,   . . . , 0.1, 0.9) T [3n 1/3 ]×1 ; β 2 = (1, 1.3, 0, 0, 2, 0, 0, 1.2, 0, 0, 0.1,   . . . , 0.1, 0.8 Similar to setting 1, we also pretend the last covariate was missed. Then, there are 2 [3n 1/3 ]−2 − 1 candidate models. The computation burden will be heavy. Therefore, a screening method to screen candidate models is desirable. That is, we use penalized regression with LASSO (Friedman et al., 2010) to prepare candidate models. Different tuning parameters may result in different models, which will be included in our resulting candidate models. Obviously, the resulting candidate model contains lots of redundant variables when the tuning parameter is very small. In order to avoid the generated candidate models including a lot of redundant variables, we use tuning parameters larger thanζ to prepare candidate models. Simulation results are provided in Table 4. Focusing on the mean values, Table 4 shows that OPT1-KL always performs better than SAIC and AIC, and OPT2-KL still outperforms SBIC and BIC, respectively. In addition, OPT2-KL always outperforms LASSO. It implies the advantages of our proposed method comparing with other competitive methods.
All candidate models include X i1 . Thus, we consider a total of 2 4 − 1 = 15 candidate models. Note that the true model is included in the candidate models. We compare our proposed method with other competitive methods based on the KL loss and MSFE. The results are presented in Tables 5 and 6, respectively. These results show that OPT2-KL always obtain the smallest KL loss and MSFE among these methods, which validates the superiority of our method. Also, we calculate the mean squared error (MSE) by methods of OPT1-KL and OPT2-KL.
where β (r) (ω) represents the estimator of β for the rth replication and β = (β T 1 , β T 2 ) T . The MSE curves, shown in Figure 3, decrease and approach zero with the increase of sample size n. The feature confirms estimation consistency numerically.

An empirical application
In this part, we apply the proposed method to the website phishing data, which was previously used by Abdelhamid et al. (2014). This data set contains three types of website (702 phishing websites, 548 legitimate websites and 103 suspicious websites). The dependent variables consist of Server Form Handler, Using Pop-Up Window, Fake HTTPs protocol, Request URL, URL of Anchor, Website Traffic, URL Length, Age of Domain and Having IP address. These variables are categorical (or binary). We transform this information into indicator variables. After this operation, the total number of predictors is 16. After the screening method, we analyse this dataset using candidate multinomial logit models. We randomly select 677 observations as the training sample and predict the remaining instances. We repeat this progress 500 times. We use the following KL-type prediction loss L KL to measure the prediction performance.
where {y test,1 , . . . , y test,n 0 } are testing observations, and p(y test,v = j) is the predicted probability of the vth test observation taking on j. Figure 4 shows the box plot of all KL-type prediction losses by seven methods. It is observed that our proposed methods of OPT1-KL and OPT2-KL produce better performances than their competitions SAIC/AIC and SBIC/BIC, respectively. In addition, OPT2-KL outperforms LASSO in terms of the KL loss.   In addition, we evaluate the prediction performance based on the hit-rate, which is computed by dividing the number of correct predictions by the size of the test sample. The prediction value of an observation is j (1 ≤ j ≤ 3) if the predicted probability of this observation taking on j has the largest value among the three predicted probability values. In addition, we also calculate the optimal rate (OPR) and the worst rate (WOR) of each method, which is the proportion of times with the largest hite-rate and the smallest hit-rate. Table 7 presents mean values of hit-rate (HRV), OPR and WOR corresponding to these methods, which shows that OPT1-KL and OPT2-KL methods obtain the larger HRV, OPR and smaller WOR than other competitions SAIC, SBIC, AIC, BIC and LASSO, demonstrating the superiority of our proposed strategies. Table 8 reports the Diebold-Mariano test (Diebold & Mariano, 2002) results for the differences in hit-rate. Note that a positive Diebold-Mariano statistic indicates that the estimator in the numerator produces a larger Table 7. Out-of-sample performances in the website phishing data. hit-rate than the estimator in the denominator. The test statistics and p-values show that the differences in hitrate between our methods and other strategies are all statistically significant.

Discussion
In the context of multinomial logit model, we proposed model averaging estimator and weight choice criterion based on KL loss with a penalty term. And we proved the asymptotic optimality of the resulting model averaging estimator under some regularity conditions. When the true model is one of the candidate models, the averaged estimators are consistent. Also, in order to reduce the computational burden, we applied a model screening step before averaging. Numerical experiments were performed to demonstrate the superiority of the proposed methods over other commonly used model selection strategies, model averaging methods, MCV, Lasso in terms of KL loss and MSFE. While we consider the multinomial logit model, the extension to other models, such as ordered logit model, warrants further investigation. And the data structure of the regressors further complicates this issue. Another interesting question is how to choose an optimal λ n . Shen et al. (2004) have proposed an adaptive method to choose λ n for model selection criterion in generalized linear models. Building similar methods for our proposed model averaging method to choose an optimal λ n warrants future researches.

Disclosure statement
No potential conflict of interest was reported by the author(s).