Adaptive sampling methods via machine learning for materials screening

ABSTRACT High-throughput virtual screening by using a combination of first-principles calculations and Bayesian optimization (BO) has attracted much attention as a method for efficient material exploration. The purpose of the virtual screening is often to search for the materials whose properties meet a certain target criterion, while the conventional BO aims to find the global extremum. Some recent works use the conventional BO by converting target properties for such motivation. On the other hand, an adaptive sampling method, where the acquisition function is based on the probability that a data point achieves a target property within a specific range, is suggested previously [Kishio et al., Chemom. Intell. Lab. Syst. 127, 70 (2013)]. In this paper, we demonstrate that such adaptive sampling is effective for the exploration of the materials whose properties meet target criteria. We conducted simulations of material exploration using an in-house database constructed by first-principles calculations and compared the performance of the adaptive sampling and conventional BO approaches. Furthermore, we evaluate and discuss the performance of acquisition functions extended to multi-objective problems for material exploration, considering multiple-target properties simultaneously. Graphical abstract


Introduction
High-throughput virtual screening by using firstprinciples calculation is a powerful tool for exploring materials with desired properties [1][2][3][4]. The recent improvement of computers and algorithms enables us to cover a large search space in the screening and to construct a computational material database possessing more than several hundred thousand or even million entries [5][6][7]. However, when the number of candidate materials is extremely large and/or computationally demanding approximations such as hybrid functionals [8] and the GW approximation [9] are required for reliable prediction of target material properties, it is still unfeasible to perform brute-force first-principles calculations for all candidate materials because of the high computational cost.
Machine learning techniques are expected to accelerate high-throughput virtual screening by constructing surrogate models and making fast predictions, thereby overcoming this situation [10][11][12][13][14][15][16][17]. Recently, adaptive sampling or black-box optimization, which suggests data points to be added by repeating the construction of a surrogate model and the acquisition of promising data points with demanding evaluation such as firstprinciples calculations, attracts much attention as a method for rapidly discovering optimal data points [18,19]. Bayesian optimization (BO), which is a global optimization method based on adaptive sampling, is one of the most commonly used methods in the material exploration [18][19][20][21][22][23][24][25][26][27]. Moreover, a multi-objective method based on BO has been developed for the search of the Pareto-front and applied to the exploration of materials in terms of multiple properties [27][28][29][30].
BO is designed for global optimization, the purpose of which is to obtain the data point whose objective value is at the maximum or minimum. However, the purpose of the high-throughput screening is often to collect materials whose target properties meet particular criteria, rather than discovering a few materials whose target properties are extremely large or small. Even if the obtained materials do not have the maximum/minimum target properties, they might still deserve further considerations, such as detailed simulations and synthetic experiments. Examples include the exploration of semiconductors possessing band gaps in a target energy range. In another case, it would be sufficient if a material property is above/ below a specific criterion. For example, we often consider materials to be stable if their formation energies are negative. To meet these requirements, some researchers have transformed the target property by a conversion function to search for materials holding band gaps within a desirable range [24,26].
On the other hand, a method for efficiently acquiring a data point holding a target property within a specific range has been developed to determine appropriate experimental conditions quickly [31,32]. Another work has reported a method for searching for data points whose target properties are above/below a specific threshold, specifically, to collect sets of lattice mismatches and interface energies in 'a low error region', in which the error in the aspect ratio of the precipitate from experiment and computation is lower than a specific threshold [33]. Recently, such a method has been extended to multi-objective problems, i.e. exploration of data points achieving multiple target properties in specific ranges. It has been applied to the optimization of equipment and operating conditions of ethylene oxide production plants to control multiple properties such as reactor inlet concentration, conversion ratio and product limitations [34]. These methods are expected to be effective in material search under the aforementioned conditions.
In this paper, we report on the development of an adaptive sampling method for high-throughput virtual screening, which enables us to efficiently find materials having a target property within a specific range, based on the previous works [31,33]. This method uses the probability that objective values fall within a specific range as an acquisition function, which determines the priority of data point selection. We assessed the performance of both the conventional BO and our method by conducting simulations of material exploration using an in-house computational database, which contains the formation energies and band gaps of 2,018 nitrides and pnictides obtained by high-throughput first-principles calculations. Moreover, we evaluate the performance of two acquisition functions extended to multiple target properties: one is a simple product of single-objective acquisition functions for respective target properties, and the other is the acquisition function proposed by Iwama and Kaneko [34]. These methods can be applied to the exploration of materials that simultaneously satisfy multiple criteria; for example, the band gap is in a target range and the formation energy is sufficiently low. We also discuss the performance of these adaptive sampling methods for multiple conditions using our computational database.

Gaussian process regression
In this work, we employed Gaussian process (GP) regression, as implemented in the PHYSBO code [27] to construct a surrogate model and estimate the mean and the standard deviation of the prediction for adaptive sampling.
To perform the GP regression, we first need to determine the covariance (or kernel) function, which defines the similarity of two points in the descriptor space. In this study, we employed the radial distribution function, which is given by where x; x 0 are arbitrary points in the descriptor space and η is a hyperparameter. Let us assume that we know data points D ¼ x i ; y i f g i¼1::N , where x i and y i are the descriptor and objective variable of ith data, respectively. The mean and standard deviation of the prediction of an arbitrary descriptor x, μ x ð Þ, and σ x ð Þ, respectively, are estimated as where s is a hyperparameter, I is the identity matrix, N ð Þ� T and K is a kernel matrix denoted as The hyperparameters are optimized by maximizing the type-II likelihood [35].

Conventional Bayesian optimization
As mentioned above, BO is a method for global optimization of a black-box function by repeating the surrogate model construction by machine learning and the data point selection based on the machine learning prediction. In BO, the acquisition function determines the priority of data point selection. In this study, we employed two typical acquisition functions in the conventional BO: probability improvement (PI) and expected improvement (EI).
Based on the GP model, the probability density function of target value y at the data point with descriptor x is represented by the normal distribution function, namely, p y ð Þ ¼ 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi PI is the probability of the objective value at descriptor x being higher than any known data points and denoted as where y max and Φ are the highest objective value of the known data points and the cumulative distribution function of the normal distribution, respectively. On the other hand, EI is the expected value of the improvement, namely, Although using y max for the acquisition function is reasonable and straightforward for global optimization, this would be inappropriate to collect data points that meet a criterion because of the following two reasons. (1) If μ x ð Þ is smaller than y max , these functions give priority to data points with large σ x ð Þ rather than data points with preferable μ x ð Þ. Therefore, if once y max becomes much higher than the criteria, the conventional BO prioritizes data points with large σ x ð Þ values over data points with sufficiently and moderately large μ x ð Þ and small σ x ð Þ values. (2) If the maximum of target property or conversion function is finite and once a data point with the maximum property is found, "probability/expected value of exceeding y max " would be theoretically meaningless because the target values of any other data points do not exceed y max .

Acquisition function for a target range and its extension to the consideration of multiple objectives
For our purpose, it would be straightforward to use the probability that a value falls within a target range as proposed by Kishio et al. [31], given as where y lower and y upper are the lower and upper limits of the target range, respectively. If we consider only one of y lower or y upper , i.e. y upper ¼ þ1 or y lower ¼ À 1, these functions can be denoted as respectively, as suggested by Tsukada et al. [33]. In this paper, we call these acquisition functions 'Probability that a value falls within a Target Range (PTR)'. The functional form of the PTR is similar to that of the PI. The main difference from the PI is whether thresholds are fixed or updated in each exploration step. The fixed threshold enables us to properly prioritize the promising data points whose target properties are likely to be moderate and sufficiently large/small, even if we acquire data with much larger/smaller target properties than the thresholds. As another advantage of the PTR, we can set the target thresholds without any conversion functions.
Furthermore, such methods can be extended to multiple objectives, aiming to explore materials that meet multiple conditions simultaneously (e.g. materials holding both band gaps within a target range and energies less than a certain criterion). Let us assume that the purpose is to collect the materials whose n target properties y 1 ; y 2 ; . . . ; y n simultaneously satisfy the following inequalities: where y lower; i and y upper; i are the lower and upper limits of the target range for property y i . Then, the PTR of each inequality can be calculated by performing GP regression for all of the target properties.
Although the covariance of the prediction of each y i can be calculated by a multi-output GP [36], we evaluate the probability of satisfying all inequalities by constructing multiple single-output GP models for each y i and assume the prediction values are independent. This not only simplifies the method but also enables us to use a different descriptor set for each GP model (e.g. we can select descriptors according to the importance of each target property as discussed in Section 2.3). We denote the descriptor set for the i th GP model as x i and the mean and standard deviation of the posterior distribution of the GP model as μ i x i ð Þ and σ i x i ð Þ, respectively. Then, the probability of satisfying Eq. (11) can be expressed as Furthermore, Iwama and Kaneko have proposed a standardized multi-objective acquisition function in a recent work [34]. The differences between Eq. (12) and their method are twofold: (i) logarithm value is used to prevent the acquisition function from being extremely small and (ii) min-max normalization is applied to balance the scale of each p y lower; i � y i � y upper; i À � . Their standardized acquisition function is denoted as where p max; i and p min; i are the maximum and minimum values of p y lower;i � y i x i ð Þ � y upper; i À � of the known data set, respectively. In this paper, the acquisition functions expressed by Eqs. (12) and (13) are named 'unstandardized multi-objective probability that values fall within target ranges (PTR)' and 'standardized multi-objective PTR', respectively.

Descriptors
To construct the GP regression model, we considered 185 compositional descriptors implemented in the matminer code [37]. The detail of these descriptors can be found in Table S1 in the Supplemental Materials.
In general, relevant descriptors greatly depend on material properties (e.g. electronic dielectric constants of stable oxides can be accurately predicted by only compositional descriptors, while the prediction of ionic dielectric constants is greatly improved by the addition of structural descriptors [38]). Then, it is important to use different descriptors depending on the material property in order to realize better predictions. The number of descriptors often reaches several tens or one hundred [12,14,15,17,27]. In some cases, especially when crystal structure is available in advance, even several hundreds of descriptors can be considered [38][39][40][41]. However, the use of such large numbers of descriptors, thereby increasing the dimension of the search space, may cause the overfitting of the surrogate model and complicate the search. Therefore, in each iteration step of adaptive sampling, we select 10 descriptors based on the permutation importance in the random forest regression [42,43]. This selection is conducted using the scikit-learn library [44]. The effect of the pruning method is evaluated by a dataset as stated below and discussed in Section S2 in the Supplemental Materials. An overview of our adaptive sampling method is illustrated in Figure 1.

Data set
To assess the performance of our methods compared with the conventional BO, we used our in-house database. The database is constructed by first-principles calculations of the formation energies (E f ) and band gaps (E g ) of nitrides, pnictides and related substances that are selected using the following criteria; hereafter we call these materials nitrides and pnictides for simplicity, including those with uncommon stoichiometries. First, we collected candidate materials from the Materials Project database [5]. We selected the candidate substances satisfying the following conditions: (1) substances include at least one of N, P, As, Sb and Bi whose oxidation number is not necessarily −3; (2) substances neither include H, O, halogens, noble gases, lanthanoids except La and Ce nor actinides; (3) substances are not simple (elementary) substances; and (4) substances with space group P1 and/or containing more than 40 atoms in the primitive unit cell are excluded due to their high computational costs and/or ambiguities of the crystal structures. The total number of candidates for the first-principles calculations was 5838.
To obtain the formation energies and band gaps of these candidate materials, we use the workflow shown in Figure 2. The computational details of the rough structure optimization can be found in the Supplemental Materials, while those of the strict optimization of structure and k-mesh density are the same as our previous work on the construction of an oxide database [38] except for the force convergence criterion. We employed the projector augmented-wave (PAW) method [45] as implemented in the VASP code [46,47] for the first-principles calculations. The structure optimizations were performed using 1.3 times the largest default plane-wave cutoff energies for the PAW datasets (ENMAX). The formation energies were estimated for optimized structures with a cutoff energy of 520 eV. The residual atomic forces and stresses of the rough structure optimization are less than 0.2 eV/Å and 6 GPa, while those of the strict optimization 5 meV/Å and 0.1 GPa, respectively. The formation energies and the relaxed structures were obtained by using the Perdew−Burke−Ernzerhof functional tuned for solids (PBEsol) [48] with Hubbard U corrections [49,50], while the band gaps by using the Heyd−Scuseria−Ernzerhof (HSE06) hybrid functional [51,52] since it substantially improves the band gaps of semiconductors including nitrides and pnictides over local and semilocal functionals [3,8,53]. The employed effective U parameters are given in Table S2 in the Supplemental Materials. The spglib code [54] was used to symmetrize the structures and identify their space groups, while the seekpath code [55] was used to generate band paths for the evaluation of the minimum band gaps along the high-symmetry lines in the reciprocal space. The VASP input files were generated by the vise code [56]. To automate and control the computational workflow, we used in-house programs, which rely on the Pymatgen [57], Fireworks [58], Custodian [57] and Atomate [59] codes.
After structure optimization, we removed 183 substances whose crystal structures are duplicates of the others. The duplication was judged in terms of the assessment of the structure matching implemented in  the Pymatgen code [57] and the space group. We have also discarded 1115 substances since they are not stable within the same composition and 2345 substances because of no band gap when adopting PBEsol(+U). The calculations of 70, 27 and 80 substances failed because of errors in the structure optimization, the PBEsol(+U) band structure calculations and HSE06 calculations, respectively. The number of survived substances was 2018, and we used them to evaluate the performance of the adaptive sampling methods. The distribution of the data entries can be seen in Figure 3.

Target materials and conversion functions for the conventional BO
To evaluate the performance of our method for single target property compared with the conventional BO, we investigate the following three cases: (i) the target materials hold E g within the range of 1.06-1.51 eV, which yields photovoltaic energy conversion efficiencies higher than 32% at the Shockley-Queisser limit [60,61] and, therefore, is relevant to the search for photoabsorber materials; (ii) the target materials have E g ranging 2.18-2.50 eV, which are desired to solve the 'green gap problem' in the light emitting diode technology [62]; and (iii) the target materials have E f not higher than −1 eV/atom, as a simple criterion of stability. We assume the two cases as multiple conditions: one is the search for substances that satisfy both conditions (i) and (iii), and the other for those satisfying both (ii) and (iii). The number of target materials in our database for each case is shown in Table 1.
To apply the conventional BO to the above purpose, we used several conversion functions following the previous works [23,24,26]. In the conventional BO, the conversion function is used as y value for GP, and conventional PI and EI are adopted for acquisition functions. In the case of E g , we use 'absolute of difference', 'Gaussian' and 'quadratic', namely, and where E g;lower and E g;upper are the lower and upper limits of the target E g range, respectively. For E f , we used the 'negative' function, namely, -E f to conform to the maximization problem. Moreover, we used the 'flatten' conversion function denoted as where E f;upper is the upper limit of the target E f range. These conversion functions are shown in Figure 4.

Comparison of single-objective acquisition functions
First, we compared the performance of singleobjective PTR and conventional acquisition functions, namely, PI and EI with conversion functions. To conduct material exploration simulations, we selected 10  data points randomly as the initial dataset, and ten substances were identified according to acquisition functions in each step. Figure 5 shows the number of identified target substances against the number of observed substances. As can be seen, the PTR acquisition function can more efficiently identify target materials than the conventional BO in all the three cases. The improvement by using the PTR is the least noticeable for the exploration of data points holding E g within the range of 1.06-1.51 eV (Figure 5(a)) and the most in the case of E f < −1 eV/atom ( Figure 5(c)). As mentioned above, if the maximum value of acquired data greatly exceeds the threshold, the conventional BO prioritizes data points with large prediction uncertainty rather than data points whose target values are likely to be moderate but satisfy the conditions. Thus, such a decrease in performance by the conventional BO can be partly ascribed to the existence of data points whose target values greatly exceed the threshold, i.e. the width of the target range of the conversion function. Compared with the target ranges of the conversion functions for E g (Figure 4(a, b)), that of the 'negative' conversion function for E f (Figure 4(c)) is wide for the whole range, and the 'flatten' conversion function, whose width of the target range is squashed to zero, improves the performance.
However, even the performance of the 'flatten' conversion function is still worse than that of the PTR. This would be because the 'flatten' function loses information on how lower the formation energy is than the criterion and surrogate models learn less information. As a similar example, for the optimization of a composite function, exploration methods using the information on the composite structure of the objective function improve the performance compared with the conventional BO using only the target value [63,64]. While we could improve the performance of the conventional BO by refining the conversion function, the PTR acquisition function works well without such a design of the conversion function. Figure 6 shows the E f ranges of acquired substances by the exploration simulation with E f;upper = −1 eV/atom. Using the PTR, the target substances of each energy ranges (i.e. E f � -2 eV/atom, −2 � E f � 1.75 eV/ atom, −1.75 � E f � 1.5 eV/atom, −1.5 � E f � 1.25 eV/atom and −1.25 � E f � -1 eV/atom) are identified at a similar pace (Figure 6(a)). On the contrary, substances with higher formation energies tend to be more slowly identified by the 'negative' conversion function (Figure 6(b)). In this case, the substances having E f � -2 eV/atom were found at the early stage [see Figure 6(b)], and then the target substances whose E f is  close to the threshold (e.g. −1.5 � E f � 1.25 eV/atom and −1.25 � E f � -1 eV/atom) were acquired less efficiently. This result confirms that the conventional BO is appropriate for global optimization but not for our motivation. The 'flatten' conversion function shows an intermediate trend between the other strategies ( Figure 6(c)).
Regarding substances with E f higher than E f;upper = −1 eV/atom, the substances with lower E f tend to be obtained early by all strategies, including the PTR. Figure 7 shows the performance of exploration with various E f;upper . When E f;upper is −0.25 eV/atom, the performance difference between any strategies and the  random search is relatively small, since most substances meet the criterion and many exploration steps are required to complete the exploration (Figure 7(a)). Setting E f;upper to −0.5, −0.75 or −1 eV/ atom, the improvement of the performance by the PTR is relatively noticeable (Figure 7(b-d)). Lowering E f;upper to −1.25 or −1.5 eV/atom, the performance improvement becomes smaller (Figure 7(e, f)) and when E f;upper is −1.75 or −2 eV/atom, even some strategies of the conventional BO are more efficient than the PTR (Figure 7(g, h)). This would be because E f;upper is near the minimum E f , the problem of the  discrepancy of E f;upper and y max in the conventional BO is alleviated. Additionally, the conventional BO where the value of y max is gradually changed should be suitable in such cases. Therefore, the advantage of the PTR is especially great when the threshold is not extreme.

Performance of multi-objective PTR
In this section, we discuss the performance of exploration for multiple target objectives using the unstandardized and standardized multi-objective PTRs. First, we note that even if we use a single-objective PTR, exploring target materials would be more efficient than the random search. Therefore, we also show the number of target materials identified by the singleobjective PTR as a baseline. The results of these exploration simulations are shown in Figure 8. There is no significant difference between the performance of the unstandardized and standardized multi-objective PTRs. This would be because (1) since the number of our target values is only two, the acquisition function does not become extremely small even if we do not use logarithm values and (2) after a few target substances are acquired, the min-max normalization does not change the value greatly, since the minimum and maximum values of the single-objective PTRs are nearly 0 and 1 as σ i x i ð Þ is generally small for the known data set.
At the early stage, both multi-objective methods improve efficiency. The performance of both unstandardized and standardized multi-objective PTRs is worse than that of the single-objective PTR of only E f in the case of 1.06 eV ≤ E g ≤ 1.51 eV and E f ≤ −1 eV/atom after the most of target data points are identified, while that for 2.18 eV ≤ E g ≤ 2.50 eV and E f ≤ −1 eV/atom is still better at the last stage than both single-objective PTR of E g and E f . The deterioration by using E g in the former case can be ascribed to the following two reasons. (1) The accurate prediction of E g would be more difficult than that of E f . Actually, when performing ten-fold cross-validation for the whole data by pruning descriptors and constructing a GP regression model that is the same as the exploration simulation, the coefficients of determination (R 2 ) of E g and E f are 0.65 and 0.81, respectively. (2) The ratio of the rest target substances versus the rest of substances satisfying the E g condition is small at the late stage. This means that the most of substances satisfying E g condition are not targeted due to E f and thereby the adaptive sampling focusing on E g would not improve or even impede the exploration performance. As shown in Figure 9(a), such a ratio of E g becomes extremely small at the late stage of the exploration compared with that of E f in the case of 1.06 eV ≤ E g ≤ 1.51 eV and E f ≤ −1 eV/atom. On the other hand, for 2.18 eV ≤ E g ≤ 2.50 eV and E f ≤ −1 eV/atom, the ratio of E g and E f Figure 9(b) is about the same at the late stage. These problems could be remedied if (1) the prediction accuracy of E g can be improved by developing better descriptors and/or increasing the initial dataset size and (2) weighting each single-objective PTR based on the estimated ratio of target substances in the rest of the data.

Conclusion
We have developed an exploration method for substances whose target properties meet specific criteria, based on the PTR acquisition function. First, we have compared the performance of the single-objective PTR and conventional BO by conducting exploration simulations using the in-house computational database and found that the PTR identified target materials more efficiently when the criterion of the target property is not extreme. Furthermore, we conducted the multi-objective exploration simulations by using the unstandardized and standardized multi-objective PTRs. The performance is improved by both the multi-objective PTRs unless the ratio of the target substances to the substances satisfying one of the given conditions is extremely small.