Data integration for multiple alkali metals in predicting coordination energies based on Bayesian inference

ABSTRACT Building machine learning models using a dataset calculated by first principles calculations is an important approach to explore the next-generation batteries. In previous studies, the rechargeable secondary battery dataset was constructed for Li ion by density functional theory (DFT) calculation, and after that, extended to five alkali metal ions. This dataset can be regarded as consisting of five alkali metal ion groups, and it is one of the interests to know which approach is preferred to build individual models specialized for each alkali metal ion or build a single model by integrating the datasets. We quantitatively evaluate the possibility of data integration in the framework of Bayesian model selection and show that the integration of datasets is suitable. In addition, extracting new knowledge using feature selection is also important in exploring next-generation batteries. In order to further advance the knowledge extraction, the reliability of the selected features should be considered to avoid misinterpretation. We evaluate the confidence level of feature selection by calculating the posterior probabilities of features using Bayesian model averaging (BMA). We found that the confidence level of feature selection increases when the number of data increases. GRAPHICAL ABSTRACT


Introduction
Li-ion batteries are widely used in electric vehicles and stationary batteries. The electrolyte is an important component of these batteries, but its development is more difficult than the development of the electrodes because of the disordered structure of the liquid. The development of next-generation batteries using ions other than Li is progressing, but the predominance of Li ion batteries still continues [1,2].
A data-driven scientific approach to materials exploration using machine learning has been investigated to explore the next-generation batteries [3][4][5][6]. Sodeyama et al. [3] predicted the coordination energy of Li-ion to the solvent using machine learning as an important measure of ion transport performance. They also extracted important features by feature selection. After that, Ishikawa et al. [4] calculated the coordination energies for five alkali metal ions, including Li, to construct a broader dataset, and compared several machine learning methods.
the rechargeable secondary battery dataset constructed by Ishikawa et al. consists of combinations of five alkali metal ions and 66 electrolyte solvents and can be regarded as consisting of ion-grouped datasets. It is of natural interest to compare an approach that builds individual models which specialize for each alkali metal ion and an approach that builds a single model for the entire dataset by integrating the datasets. Ishikawa et al. built only an integrated model for the dataset, and no comparison was made with the approach that builds models individually. Therefore, we quantitatively evaluate the possibility of integrating the dataset using the Bayesian model selection framework [7,8].
In exploring the next-generation batteries by datadriven science, it is also expected to extract new knowledge by machine learning. One of the important techniques is a feature selection. Machine learning models consisting of a small number of features selected by the feature selection are highly interpretable and support experts in considering the relationship between features and the target properties. Extracting knowledge using feature selection is widely studied in the field of materials informatics [3][4][5][6][7][8][9][10][11][12][13][14]. Since the results of feature selection become the basis of experts' consideration, the reliability of the results is also important to avoid misinterpretation. In the previous battery researches [3,4], LASSO and an exhaustive search method have been used and the reliability of feature selection results has been evaluated qualitatively based on the weight diagram of the exhaustive search, but it has not been evaluated quantitatively. In this study, we evaluate the confidence level of feature selection by the marginal posterior probability of each feature based on Bayesian inference [15][16][17]. We calculate the marginal posterior probability based on the BMA framework [18]. Reliability of machine learning model is widely researched, including the field of materials exploration and materials informatics, and it helps researchers to properly and effectively use machine learning considering its risk [19][20][21][22][23][24][25][26][27][28][29][30][31].
Our main contributions are as follows.
(1) We evaluate the possibility of data integration based on Bayesian model selection, and we show that the integrated model is more suitable than the individual models. (2) We evaluate the confidence level of the feature selection by Bayesian model averaging. We show that features considered to be important in previous researches have high confidence level, and the confidence level increases when the number of data increases.
The code used in this paper is available at https:// github.com/okada-lab/exhbma.

Bayesian linear regression and exhaustive search method
We formulate the linear regression model on the basis of Bayesian inference [32]. Each data is represented by a pdimensional feature vector x ¼ ðx ð1Þ ; . . . ; x ðpÞ Þ T 2 R p , and a target value is denoted as y 2 R. When we observe N samples, the whole dataset is denoted as For simplicity, we assume that the observed feature vectors and target values are centered and, in particular, the feature vectors are standardized; thus, the next relationships hold for the target value y and every i-th feature x ðiÞ .
We do not assume the standardization of the target values considering the model selection discussed in Section 2.3. Suppose that a true relationship is modeled as a linear function of the feature x. When we observe the target value, an observation noise is always included. We model the target value y as the sum of the linear model and the observation noise ε. The noise variable ε follows a normal distribution with a mean of 0 and a variance of σ 2 ε . This observation process is formulated as When N observed data are independent, the likelihood of the entire observation process is expressed as where 1 N denotes the N-dimensional vector with all elements being one and I N denotes the N � N identity matrix.
For the prior distribution of the coefficients w, we set the normal distribution with a mean of 0 and a variance of σ 2 w for each coefficient independently This prior distribution expresses the prior knowledge that the magnitude of the coefficients will be around σ w . We set the following prior distribution for intercept w 0 where σ 2 y is the variance of the target value y, not a hyperparameter like σ ε and σ w . Since we consider the centering preprocessing described in Equations (1) and (2), the mean value of the intercept can be considered as 0 [33]. We use the variance of the target value σ y , which is a data-dependent value, as a proxy for the scale of prior uncertainty of intercept w 0 [18]. We omit σ y from the notation pðw 0 Þ to avoid confusion that σ y is a hyperparameter. From Bayes' theorem pðAjBÞ ¼ pðBjAÞpðAÞ pðBÞ ; the joint posterior distribution of intercept w 0 and coefficients w given by data X and y is calculated as where the denominator is the likelihood of the prediction model with fixed values of σ ε and σ w pðyjX; σ ε ; σ w Þ ¼ ð pðyjX; w 0 ; w; σ ε Þpðwjσ w Þpðw 0 Þdwdw 0 (10) ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Following the maximum a posteriori (MAP) estimation, the intercept w 0 and coefficients w are estimated by maximizing the posterior distribution expressed as Equation (9); in this case, this estimation method is equivalent to the Ridge regression [33]. By using the Equations (5)- (7), the posterior distribution of the intercept w 0 and coefficients w is calculated as where Λ is the precision matrix and μ is the mean vector. From this result, the MAP solution for the intercept w 0 is 0 and that for the coefficients w is μ.
We considered the prior distribution for the hyperparameters σ ε and σ w and the final likelihood of the model is calculated as ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where pðσ ε ; σ w Þ is properly defined at the interval ð0; 1Þ. This marginalization over σ ε and σ w cannot be performed analytically, so a numerical integration is performed. Note that σ y in Equation (7) is not a hyperparameter, so marginalization is not performed.
We have used all the p features, but we can consider submodels by choosing which features to include in the model and compare their performance. To specify a submodel, we introduce an indicator vector c as The i-th feature x ðiÞ is included in the model when c i ¼ 1 and is not included when c i ¼ 0. By introducing a submatrix X I that extracts column i from X where c i ¼ 1 and substituting X with X I in Equation (15), we can obtain the marginal likelihood for pðyjc; XÞ. An exhaustive search method, which compares all 2 p models, is often carried out to select the model with the best evaluation metrics . In the previous study by Ishikawa et al. [4], for example, the crossvalidation error [33] is used as the evaluation metric. A null model, a constant prediction model where no feature is considered, is sometimes excluded from the search set depending on the analysis. However, since we do not have strong support for excluding the null model, we include the null model in the following formulation.
When we apply the exhaustive search method to the Bayesian linear regression model formulated above, we can compare models by the posterior probability of the model where the denominator is a likelihood for the entire model set. We use P c ð�Þ to express the summation over all combinations of the indicator vector. For the prior distribution of the model specified by the indicator vector, pðcÞ , we set the Bernoulli distribution for each element of the indicator vector as The hyperparameter α represents the proportion that each feature is included in the model in the prior distribution, and the expected number of features included in the model is E½ P i c i � ¼ αp. In this paper, we fix α ¼ 0:5 and omit α from the notation of Equation (19). With this setting, all the models have equal prior probabilities.
In the exhaustive search method, it is also possible to evaluate the importance of a feature using a weight diagram, which plots the values of the coefficients arranged in the order of model performance [3][4][5].

BMA and feature selection
In this study, we consider the uncertainty of each submodel specified by the indicator vector c and introduce a method of quantitatively evaluating the confidence level of feature selection using a weighted average of the model posterior probabilities, which is called BMA [18]. The overview of this method is shown in Figure 1. This method makes it possible to evaluate the confidence level of feature selection and quantify the importance evaluation of features, which has been a qualitative one when using the weight diagram of the exhaustive search method.
For a random variable τ that one wants to predict, such as the target value y, the coefficient of the feature w i and the element of the indicator vector c i , the prediction by BMA is formulated as pðτjX; yÞ ¼ X c pðτjc; X; yÞpðcjX; yÞ: (20) The summation over all combinations of indicator vectors can be calculated using the result of the exhaustive search. Therefore, we refer to the prediction with this form as ES-BMA. When applied to the prediction of the target value ŷ for new data x, Equation (20) is expressed as pðŷjx; X; yÞ ¼ X c pðŷjx; c; X; yÞpðcjX; yÞ: (21) This prediction by BMA integrates the predictions of the submodels considering their uncertainty pðcjX; yÞ.
The prediction by BMA is superior to the prediction using any single model specified by c in predicting new data [18]. When applied to feature selection, we estimate the probability that the i-th feature is included in the model by marginalizing over the model with c i ¼ 1: where c ni is an indicator vector excluding the c i element. This posterior probability is well calibrated to the precision of feature selection [15], and we use this probability value as the confidence level of feature selection.
Here, we briefly explain the confidence evaluation for feature selection in contrast to regression and classification. In the case of regression, we can evaluate the confidence of the target variable prediction by estimating the confidence interval [19][20][21]. In the case of classification, when using a discriminant function such as Fisher's linear discriminant, confidence evaluation is not performed. However, we can evaluate the confidence level of the estimated class by the model's output probability value using a discriminative model that models the class posterior probability, such as logistic regression [32]. The case of feature selection is similar to classification. When features are selected by parameter optimization, such as in LASSO, confidence evaluation is not performed. However, we can evaluate the confidence level of the selection result by the posterior probability value based on Bayesian inference [15][16][17].
In addition to model interpretability, feature selection also has the advantage of reducing the computational cost. The prediction by BMA using Equation (21) requires all the features and all the submodels to be used in the prediction phase, and this is computationally expensive. To make a better prediction with a single model using fewer features, features whose posterior probability calculated by Equation (23) is above 0:5, not features that are in the model with the highest posterior probability pðcjX; yÞ, should be selected [37].

Model selection for different model building approaches
When we have multiple approaches in building a model and want to choose the best one, model selection is performed. In Bayesian inference, model selection is performed based on the negative log likelihood of the model, also known as the Bayesian free energy [7,8]. For simplicity, we use the log likelihood for model comparison and choose the approach with the larger log likelihood.
In this paper, we consider a case where a dataset is regarded as consisting of subgroup(s) and we want to compare an approach that builds individual models for subgroup(s) and an approach that builds a single model for the entire dataset by integrating subgroup(s). In general, as the number of data increases, the model accuracy and the stability of the learning process increases; therefore, integrating datasets can improve the model accuracy. On the other hand, integrating datasets with significantly different mechanisms may negatively affect model learning. Therefore, we need to compare these two approaches.
Let k be a subscript representing a subgroup(s). The dataset for subgroup(s) k is denoted as X ðkÞ ; y ðkÞ , the set of parameters in the corresponding model is When considering individual models for all subgroups, the likelihood is Here, we assume that the models for each subgroup are independent.
On the other hand, if all datasets are integrated and considered as a single dataset, then the likelihood is calculated by replacing k-dependent parameters in Equation (24), dθ all pðy; θ all jc all ; XÞpðc all Þ: Model selection between the individual models approach and the integrated model approach is performed by comparing Equations (25) and (26). Since Equations (25) and (26) are probability distributions of the target variable y, they are affected by the scaling operation on y. For example, when y is multiplied by a, pðyjXÞ changes by a factor of 1=a. If we perform a standardization operation on y, the individual and integrated models will have different variable transformations; therefore, the comparison of probability distributions will not be correct. To prevent this effect, only the centering operation is performed on the target variable y, and the standardizing operation is not performed.

Dataset
We use the rechargeable secondary battery dataset used by Ishikawa et al. [4] to show the effectiveness of the ES-BMA. In this dataset, coordination energy is used as the target property since it can be considered as a good indicator for battery performance. Coordination energy is obtained by DFT calculation. The coordination energy dataset consists of 330 data, which are combinations of five alkali metal ions (Li, Na, K, Rb, and Cs) and 66 electrolyte solvents. Seventy electrolyte solvents were used in their study, but four were duplicated, so we exclude them. The feature vector consists of 13 elements, three of which are derived from alkali metal ions and 10 from electrolyte solvents. Of the 13 features, six are also obtained by DFT calculation.

Experimental setting
As mentioned in Sect. 2.2, we use a Bernoulli distribution, Equation (19), with the hyperparameter α ¼ 0:5 for the prior distribution of the model specified by the indicator vector.
To integrate over the hyperparameters σ ε and σ w in the likelihood Equation (15), numerical integration is performed using the values calculated for the grid within a finite range.
For the prior distribution of the hyperparameters σ ε and σ w , we use a distribution pðxÞ / x À 1 defined within the finite range, which is a noninformative prior for a scale-invariant parameter [32], to reflect the case that we have little prior knowledge. Prior distribution pðxÞ / x À 1 itself is an improper prior, which cannot be normalized at the interval ð0; 1Þ. However, it can be expressed asymptotically by a proper distribution, for example, the gamma distribution gðxÞ ¼ 1 ΓðkÞθ k x kÀ 1 e À x=θ with shape parameter k ! 0 and scale parameter θ ! 1.

Model selection between individual models and an integrated model
the rechargeable secondary battery dataset consists of combinations of five alkali metal ions and 66 electrolyte solvents and can be regarded as consisting of iongrouped datasets. Therefore, it is of natural interest to compare an approach that builds individual models which specialize for each alkali metal ion and an approach that builds a single model for the entire dataset by integrating the datasets. By applying the model selection framework shown in Sect. 2.3, we compare the two approaches. In this case, subgroup index k introduced in Sect. 2.3 corresponds to an alkali metal ion ðk ¼ Li; Na; K; Rb; CsÞ. Table 1 shows the likelihoods of the individual models and the integrated model. The likelihoods of each alkali metal ion model are also listed for the individual models. From this result, we can integrate all the datasets and regard that they follow the identical distribution.

Application of ES-BMA
From the result of model selection, the integrated model is the main target of analysis. However, to elucidate the effect of data integration, we compare results of both approaches. Figure 2 shows the posterior probabilities of features for individual models. The posterior probabilities of the natural bond orbital (NBO) charge of the O atom and the total dipole are close to 1, and those of the highest occupied molecular orbital (HOMO) energy and melting point are close to 0, so the selection result for these features can be trusted with high confidence. This result is consistent with that of a previous study in which the NBO charge of the O atom and the total dipole are strongly related to the coordination energy [4]. On the other hand, the posterior probabilities of total energy and density are approximately 0.5, showing that the confidence levels of the selection result for these features are low. Figure 3 shows the posterior probabilities of features calculated for the integrated model. The posterior probabilities are close to 0 or 1, indicating that the results of feature selection are clearer than those of individual models. It can be considered that the confidence level of feature selection increases with the number of data compared with the individual models. Features with a posterior probability greater than 0.5 were also selected in the previous study by Ishikawa et al. [4], who used an exhaustive search method. Figure 4 shows the parity plot between coordination energy predicted by ES-BMA (Equation (21)) and that obtained by DFT calculation. Data which are assigned to test set in each split of 10-fold crossvalidation are plotted. The accuracy obtained by the 10-fold cross-validation is 0.131 eV. This accuracy is comparable to that obtained by Ishikawa et al. [4], which directly optimizes the cross-validation error. Figure 5 shows how the posterior probability of each feature changes for all the data integration patterns. The horizontal axis is the number of integrated alkali metal ion dataset, and the vertical axis is the posterior probability value. The posterior probabilities of the NBO charge of the O atom and the total dipole are close to 1 early in the data integration process. In previous studies, these two features are also considered particularly important [4]. The evaluation of posterior probabilities may contribute to the early identification of essential features in the data collection process.
As we integrate datasets, the posterior probability becomes closer to 0 and 1, and the confidence level becomes clearer. In particular, the lowest unoccupied molecular orbital (LUMO) energy, which has low importance in the individual models, becomes more important as we integrate datasets. Based on model selection result, we can regard that integrated datasets follow the identical distribution, so we consider that this improvement in the confidence level is due to the increase in the number of data.

Posterior distribution of hyperparameters
To show that the finite range of the prior distribution for the hyperparameters σ ε and σ w is appropriate for numerical integration, we examine the posterior distribution of these two parameters. From Bayes' theorem, the posterior distribution is obtained by normalizing the product of the likelihood of the two parameters calculated by Equation (11) and the prior distribution pðσ ε ; σ w Þ. The result is shown in Figure 6, which reveals that the peak of the posterior distribution is well included in the region defined as the prior distribution. By comparing the posterior distribution of the individual model and that of the integrated model, we find that the peak is sharper in

Conclusions
In this paper, we introduced a framework for verifying data integration and a feature selection method with confidence level evaluation using the rechargeable secondary battery dataset. This dataset can be regarded as groups consisting of five alkali metal ions. Hence, it is one of the interests which approach is preferred to build individual models which specialize for each alkali metal ion or build a single model by integrating the datasets. We showed the validity of building a single model by performing a model selection in the framework of the Bayesian inference. The results provide quantitative guidance on whether we should build a single integrated model or individual models when multiple datasets are available. Note that in this paper, we performed data integration only within simulation datasets. Recently, data integration of experiment and simulation datasets has been studied [38], so for our future work, we plan to validate the integration of datasets from both simulation and experimental data.
In exploring the next-generation batteries, knowledge extraction through feature selection is also important. In advancing the discussion based on feature selection, it is necessary to consider the reliability of the feature selection results. We evaluated the confidence level of feature selection by the marginal posterior probability of each feature. In the ES-BMA, we use Bayesian model averaging and calculate the posterior probability of features considering the uncertainty of models searched by the exhaustive search. By applying the ES-BMA, selection result of the previous study is confirmed to be highly confident. It is also shown that the confidence level becomes higher by integrating the datasets. In particular, LUMO energy, which has a low posterior probability in individual models, proves to be important by integrating the datasets.
In materials informatics, the number of data is often small. Therefore, data integration framework and uncertainty evaluation method of feature selection becomes important since the former improves model performance by increasing data and the latter helps experts to extract new knowledge considering the data fluctuations due to small data.
One of the challenges of the ES-BMA is that it is computationally expensive for both learning and inference since it involves an exhaustive search and numerical integration. In particular, the computational cost of an exhaustive search increases exponentially with respect to the number of features; thus, it is often computationally prohibitive for real data. To overcome this problem, a sampling approach such as a Monte Carlo method can be used to estimate the result of ES-BMA in a statistically unbiased manner [18,35].

Disclosure statement
No potential conflict of interest was reported by the author(s).