Sparse modeling of chemical bonding in binary compounds

ABSTRACT A sparse model for quantifying energy difference between zinc-blende and rock-salt crystal structures in octet elemental and binary materials is constructed by using the linearly independent descriptor-generation method and exhaustive search, following the previous work by Ghiringhelli et al. [Phys Rev Lett. 2015;114:105503]. The obtained simplest model includes only atomic radius information of constituent atoms and its physical meaning is interpreted in relation to van Arkel-Ketelaar’s triangle for classifying chemical bonding in binary compounds.


Introduction
Recently, data-intensive scientific discovery and design have been the focus of great attention for the acceleration of research and development in materials science, being widely called materials informatics (MI). The major aims of MI are the exploration of new materials with desired properties, the optimization of existing materials for particular performances, and the understanding of underlying physical mechanisms for further development. Generally, if one demands high predictability for a model constructed by data-science machine-learning techniques, complicated methods using non-linear models such as kernel ridge regression [1], neural network [2], and random forest [3] are appropriate, though their interpretation becomes troublesome because of the non-linearity involved in the modeling. On the other hand, simple modeling such as linear regression with interpretable descriptors is suitable for extracting intuitive understanding from materials data at the sacrifice of its predictability to a certain degree. Sparse modeling [4] is the statistical learning technique to realize such a simple model by the selection and reduction of the descriptors assumed.
A pioneering work with the use of the sparse modeling for materials properties was reported by Ghiringhelli et al. [5]. Total energy differences between zinc-blende and rock-salt crystal structures obtained by density-functional-theory (DFT) calculations for 82 elementary and binary semiconductors AB are modeled with the least absolute shrinkage and selection operation (LASSO) [6] and exhaustive search techniques within the linear regression modeling. They have succeeded to construct simple models with a small number of descriptors at relatively high predictability. The key to success can be found in the construction of the descriptors. They first assumed several basic descriptors such as ionization potential, electron affinity, and some DFT atomic data for constituent atoms and then operated them to get higher-order descriptors with multiplication, division, and functionalization up to the order of thousands. The LASSO technique is utilized to reduce the number of descriptors to tens by statistical procedures and error evaluations. Finally, an exhaustive search is used to extract the most important descriptors for a given number of descriptors among them. Nevertheless, the obtained model is still far understandable with physical intuitiveness because of complicated functions of several basic descriptors.
In this study, we aim to construct a simpler and interpretable model for the same problem as that Ghiringhelli and coworkers attacked. Our idea is two folds: one is the symmetrization of basic descriptors for the permutation of constituent elements A $ B and the second is the high-order operation of basic descriptors without using complicated functions like exponential. Also, regression trials with a single basic descriptor will be carried out. During the high-order descriptor operations, collinearity problems (including multicollinearity and near multicollinearity) often take place because of strong dependency between the generated higher-order descriptors by products of descriptors. The linear independent descriptor generation (LIDG) method recently proposed by us [7] is adopted to remove those collinearity problems if they happen. Our models are found to be as simple as the previous models, without utilizing complicated descriptors and able to quantitatively classify the chemical bonding in binary compound systems.

Target variables
The target variables prepared by Ghiringhelli et al. [5] are used for the construction of modeling in this study. Namely, total energy differences between zinc-blende (ZB) and rock-salt (RS) type structures calculated for 82 octet elementary and binary compounds AB with main-group elements as ΔE ¼ EðRSÞ À EðZBÞ: (1) The data used for the present regression are listed in Appendix A. To confirm the precision of the target data, total energies of the 82 systems with ZB and RS structures are recalculated by using the all-electron full-potential linearized augmentation plane-wave method implemented in our HiLAPW code [8] and the root-mean-square errors are 7meV/atom in the total-energy difference and 0.009Å in the equilibrium lattice constant.

Descriptors
Ghiringhelli et al. [5] distinguished the constituent elements A and B according to the size of electronegativity. However, permutation of A and B leads to no physical change in the system at the equiatomic condition and models constructed should be symmetric by the permutation. In the present study, we generate descriptors as follows: (1) Prepare basic descriptors x A and x B for constituent atoms A and B on the basis of our intuition. (2) Symmetrize them by permutation A $ B and add their inversion, being called first-order descriptors. (3) Generate high-order descriptors by multiplication of the first-order descriptors. (4) Remove multicollinearity and near multicollinearity by erasing the irrelevant descriptors (5) Iterate to generate the high-order descriptor generation and to reduce collinearity problems, if needed.
Concerning the basic descriptors, easily obtainable physical quantities could appeal our intuition to construct interpretable models. Atomic radius, ionization potential, electron affinity, and electronegativity are adopted in this study and tabulated in Appendix A.
As for the symmetrization and inversion, we consider the following operations as The high-order descriptors are generated by multiplication of the first-order descriptors. From m first-order descriptors, m H n ¼ mþnÀ1 C n n th-order descriptors can be constructed. As mentioned above, every time highorder descriptors are generated, multicollinearity and near multicollinearity are removed by the LIDG method [7]. Here, multicollinearity is a linear dependency between descriptor vectors. Such a linear dependency often occurs when higher-order descriptor generations are performed. The existence of the linear dependency means that Xc ¼ 0 has non-trivial solutions c, where X ¼ ½x 1 ; x 2 ; :::; x p is a descriptor matrix (design matrix) with descriptor vector x i as the columns. p is the number of descriptors. Thus, to find multicollinearity, all non-trivial solutions of Xc ¼ 0 should be found. Fortunately, the non-trivial solutions can be easily found by computing the reduced row echelon form (RREF) [9] of X. In the LIDG method, X is linearly independentized by appropriately removing the detected descriptors having a multicollinearity relationship. Since the constant term is originally included in the regression model, constant terms additionally arising by multiplication are removed tacitly.

Model selection
In sparse modeling [4], the best model that has the highest predictivity is usually selected by the crossvalidation procedure [10,11]. For the purpose, we employ the leave-one-out scheme in this study, where 81 sets of data (target and descriptors) are used for the construction of model and the remaining one set of data called j is adopted for estimating the predictivity error [12] as where y i ,ŷ i , and y are true, predicted, and averaged target valuables, respectively. Then, the measure of predictivity for the model selection by the crossvalidation is calculated by average as with the total number of data set N (82 in this case). To obtain models as simple as possible, the exhaustive search method [13] for a given number of descriptors is employed.

Results
Using the procedures described in the previous section with the four kinds of basic descriptors, 86 descriptors are generated up to the second order, called descriptor space 1 (DS1) as listed in Appendix B. Figure 1 shows Q 2 for the best models by the exhaustive search as a function of the number of descriptors in DS1. That is, when M descriptors are used, linear regressions (ordinary least-squares method) are performed for all the combinations of p C M descriptors, and Q 2 of Equation (4) is calculated for each, and then the maximum value of Q 2 is plotted. Here, p ¼ 86 is the total number of descriptors in DS1. Detailed results of model selection are summarized in Appendix C. It is seen in Figure 1 that as the number of descriptors M is increased, the predictivity Q 2 with DS1 is also increased through M ¼ 3 and then almost saturated afterward. Therefore, the model with M ¼ 3 is appropriately simple with relatively high predictability. This model called Model 1 is given as where EN and r are electronegativity and atomic radius, respectively. The regression performance of Model 1 is shown in Figure 2. It is quite interesting that only the electronegativity and atomic radius are included in Model 1 with a simple form, but its physical meaning is not readily understandable. Electronegativity and atomic radius are known to be empirically correlated as [14,15] EN / 1 r though they are not so highly collinear that our nearcollinearity criteria judge. Note that Pearson's correlation coefficient is ρ ¼ À0:88. Therefore, atomic radius only and electronegativity only in the basic descriptor set are used on trial to generate 24 high-order descriptors up to fourth order for sparse modeling, called descriptor space 2 (DS2) and 3 (DS3), respectively, as listed in Appendix B. As results, it is found that DS2 gives much better Q 2 than DS3. For example, Q 2 for M ¼ 5 in DS2 and DS3 is 0.892 and 0.714, respectively. In Figure 1, Q 2 with DS2 becomes almost constant beyond M ¼ 2 and the model with M ¼ 2 might be a good one from the viewpoints of predictivity and interpretable sparse modeling, being called Model 2 expressed as 3 À5:02jr A À r B j þ 6:87 f g À 0:18 (7) and its regression performance is given in Figure 3. It should be emphasized that Model 2 is a really simple model including atomic-radius descriptors only at high predictivity (Q 2 ¼ 0:866). Regression performance of the present models (Model 1 and Model 2) and the previous ones (Model A, Model B, and Model C) is summarized in Table 1 in terms of decision coefficient  Table 1 are the best models with descriptors selecting one, two, and three, respectively, from left in the following descriptor list: jr sA À r pB j expðr sA Þ ; jr pB À r sB j expðr dA Þ : Note that the values of ionization potential, electron affinity, and electronegativity used in the present study are slightly different from those in the previous work [5]. Because of that, MAE and MaxAE do not perfectly coincide with those listed previously.

Discussion
Let us consider the possible consequences of Model 2 that is the simplest one among the models constructed in the preceding session. In the cases of elemental materials (r A ¼ r B ), ΔE becomes positive for r < 1:68 Å, preferring zinc-blende (properly diamond) structure. Actually, no elementary materials nor compounds with the same atomic radii greater than 1.68 Å are included in the present octet compounds. For compounds with largely different atomic radii, rock-salt structure with higher coordination than zinc blende is realized. From Equation (7), the borderline between ZB and RS structures, namely ΔE ¼ 0, is given as jr A À r B j ¼ À0:0359ðr A þ r B Þ 3 þ 1:37, providing a quantitative guideline to classify ZB and RS structures in the present systems. The borderline and the structural classification will be discussed further below in relation to van Arkel-Ketelaar's triangle of chemical bonding. Approximately, Model 2 shown in Equation (7) tells that the energy difference between ZB and RS structures is linearly scaled to the absolute difference in the atomic radius of the  Table A1.   Table A1. constituent atoms ( / jr A À r B j) and inversely proportional to the cell volume ( / ðr A þ r B Þ À3 ). In the octet compounds, the cohesion mechanism is dominated by covalent bonds with additive ionic electrostatic interactions. Covalent bonds originate from the formation of bonding and antibonding states between neighboring orbitals and are roughly proportional to the size of the corresponding hopping integrals. According to the scaling rules in the tight-binding theory [19][20][21], the hopping integral for neighboring p orbitals is proportional to R À3 , where R is the interatomic distance. Therefore, it is reasonable to see the inverse proportionality of the cell volume in the energy difference. Chemical trends in the stable structure directly derived from Model 2 are listed in Table 2.
Empirically, electronegativity is well known to be related to chemical bonding in compounds [22] and has an inverse relation to the atomic radius, as shown in Equation (6). With this relation, the trends with respect to the atomic radius listed in Table 2 can be converted to trends with respect to electronegativity given in Table 3. This result is consistent with our knowledge of the stable structure in semiconductors such that covalent (ionic) compounds tend to possess zinc-blende (rock-salt) crystal structure [23]. Nevertheless, it is quite interesting to able to model the energy difference ΔE quantitatively better with atomic radius than with electronegativity, as mentioned in the preceding section.
van Arkel-Ketelaar's triable is a map for displaying chemical bonding of compounds [24][25][26]. Metallic, ionic, and covalent bonding are represented in a twodimensional (2D) map with the axes of mean and difference of electronegativity of the constituent atoms in the latest version [27,28]. Following van Arkel-Ketelaar's triangle, the total energy difference given by Equation (7) is plotted in a 2D map of the sum and difference of atomic radius as shown in Figure 4. Figure 4 precisely reproduces the stable crystal structure, either zinc-blende or rock-salt and covalent or ionic bonding via relation between structure and chemical bonding. Note that the models constructed by regression include no information about chemical bonding characteristics beyond the training data. As a matter of fact, Model 2 can not represent metallic systems, that may correspond to an empty region in the present triangle shown in Figure 4.

Conclusions
A simple model quantifying energy difference between zinc-blende and rock-salt structure in octet elemental and binary semiconductors is obtained with only the information of atomic radius of constituent atoms, leading to a 2D map of chemical bonding represented in terms of the sum and difference of atomic radius. It is found that our descriptor-generation method including symmetrization for permutation, multiplication operation to higher order, and removal of collinearity problems is crucial to construct such a sparse model in addition to the exhaustive search. That is, since we use only symmetrized descriptors as initial descriptors, it is guaranteed that a correct model can be obtained at least in terms of symmetry. In addition, since inappropriate descriptors that do not satisfy symmetry are not included, the number of descriptor candidates can be reduced. The above two are the effects of descriptor symmetrization. On the other hand, the model obtained in the previous study does not satisfy the symmetry due to the permutation of A and B elements. Therefore, no matter how high the prediction accuracy, it can be said that this is a physically inappropriate model at least in symmetry. One would also like to mention the effect of removing multicollinearity. For example, if there is multicollinearity, such as x i ¼ ax j þ bx k , in descriptor matrix, the estimation and prediction accuracies do not change regardless of which one of x i , x j and x k is deleted from the descriptor matrix. Therefore, it cannot be decided from statistics whether x i , x j , or x k should be removed. In our LIDG method, however, since the multicollinearity has been detected prior to regression, one can introduce the simplicity of descriptors in the descriptor selection Table 3. Relations between electronegativity, stable structure, and chemical bond, derived from  process and employ two descriptors with a simpler form between x i , x j , and x k . Therefore, the obtained model is the simplest model among the models that give the same prediction accuracy. This is the advantage of the LIDG method in the detection and removal of multicollinearity. Table A1. Target data for regression taken from Ghiringhelli et al. [5]. ΔE is total energy difference (in eV/atom) between zincblende and rock-salt structures ΔE ¼ EðRSÞ À EðZBÞ of 82 elementary and binary systems AB.  [14]:, b Ref. [29]:, c Ref. [30]:, d Ref. [15]:, e Ref. [5].

Appendix A. Regression data
Target and descriptor data used in the present study are listed in Tables A1 and A2, respectively.