Hybrid algorithm of Bayesian optimization and evolutionary algorithm in crystal structure prediction

ABSTRACT We propose a highly efficient searching algorithm in crystal structure prediction. The searching algorithm is a hybrid of the evolutionary algorithm and Bayesian optimization. The evolutionary algorithm is widely used in crystal structure prediction, and the Bayesian optimization is one of the selection-type algorithms we have developed. We have performed simulations of crystal structure prediction to compare the success rates of the random search, evolutionary algorithm, Bayesian optimization, and hybrid algorithm for up to ternary systems such as Si, Y2Co17, Al2O3, and CuGaS2, using the CrySPY code. These results demonstrate that the evolutionary algorithm can generate structures more efficiently than random structure generation, and the Bayesian optimization can efficiently select potential candidates from a large number of structures. Moreover, the hybrid algorithm, which has the advantages of both, is proved to be the most efficient searching algorithm among them. GRAPHICAL ABSTRACT


Introduction
The search for new materials is the key to the advancement of science technology. With the development of computational capabilities, it has become possible to predict the stable crystal structure for given chemical composition. Crystal structure prediction (CSP) has been actively conducted in conjunction with firstprinciples calculations for the last decades. The history of CSP methodology and new material discovery is well summarized in the review paper [1]. To date, a lot of effort has been put into developing searching algorithms, such as random search (RS) [2][3][4][5], simulated annealing [6,7], minima hopping [8,9], evolutionary algorithm (EA) [10][11][12][13], and particle swarm optimization (PSO) [14,15]. RS is a quite simple algorithm and widely used, and EA as implemented in the USPEX code [11][12][13] is one of the most popular algorithms today. In EA, genes of low-energy structures of the parent generation are inherited to the structures of the offspring generation, which allows for more efficient structure generation.
Although significant advances in terms of structure generation have been made over the years, CSP simulations are computationally expensive and quite difficult. Local structure optimization and energy evaluation by first-principles calculations are time-consuming in CSP.
The development of more efficient searching algorithms to reduce computational costs is eagerly demanded. It is interesting to note that recent attempts have been made to reduce the computational cost of structure optimization by using machine learning potential instead of first-principles calculations [16][17][18][19][20]. In our previous study, we proposed a unique method for efficiency improvement using selection-type algorithms [21][22][23][24].
In the conventional methods such as RS and EA, all generated structures have to be locally optimized one by one in order. In the selection-type algorithms, however, we can efficiently select good candidates for priority optimization using machine learning. Our selectiontype algorithms can search for the most stable structure among a large number of structures with a small number of trials. We have proposed a searching algorithm accelerated by Bayesian optimization (BO), which is a well-established technique for black-box optimization [25,26], as one of the selection-type algorithms. Our CSP simulations with BO demonstrated a reduction of the number of searching trials required to find the global minimum structure by 30-40% in comparison with RS [21].
This BO scheme is highly scalable and was expected to be further developed. Although BO currently depends on random structure generation, structure generation by EA can be integrated into our BO scheme. We have developed an open-source CSP software called CrySPY and implemented RS, EA, and BO [24]. In this study, we propose a highly efficient searching algorithm which is a hybrid algorithm of BO and EA (BO-EA). We perform the CSP simulations by RS, EA, BO, and BO-EA in a variety of materials such as semiconductors, intermetallic compounds, and ionic crystals using CrySPY. Specifically, we investigate the success rates in finding the most stable structure for the systems of Si of which the unit cell is composed of 16 atoms (hereinafter notated as Si 16 ), Si 32 , Y 2 Co 17 , Al 8 O 12 , and Cu 4 Ga 4 S 8 , and discuss the efficiency of these searching algorithms.

Random search
RS is a quite simple searching algorithm. In our method, initial structures with a specific chemical composition are randomly generated at the beginning, and local structure optimization is performed one by one using first-principles or classical interatomic potential codes. CrySPY can generate random structures with specific space groups using PyXtal [27] internally. In the random structure generation, a space group is randomly selected and the lattice parameters are automatically set to appropriate values depending on the number of atoms.
For the systems of Si 16 and Si 32 , we used the soiap code [28] with Stillinger-Weber potential [29] to evaluate the energy and to carry out the local structure optimization. For the systems of Y 2 Co 17 , and Cu 4 Ga 4 S 8 , the VASP code [30][31][32][33] was employed, and the QUANTUM ESPRESSO [34][35][36] with pseudopotentials from PSlibrary [37] was also used for the Al 8 O 12 system. Except for the two Si systems, the total-energy calculations and the local structure optimization were performed using the density functional theory (DFT) with the projector-augmented wave (PAW) method [38,39]. The generalized gradient approximation (GGA) by Perdew, Burke, and Ernzerhof [40] was used for exchange-correlation functional. Cutoff energies of 335, 639, and 369 eV for the plane-wave expansion of the wave function were adopted for Y 2 Co 17 , Al 8 O 12 , and Cu 4 Ga 4 S 8 , respectively. k-point meshes were automatically generated in CrySPY with pymatgen [41]. The k-point mesh densities of 100, 80, and 80 Å −3 were used for Y 2 Co 17 , Al 8 O 12 , and Cu 4 Ga 4 S 8 , respectively. In the DFT calculations, Y 2 Co 17 was treated as a ferromagnet. For other algorithms described in later sections, the energy calculations and the local structural optimization were performed in the same way as RS.
In the CSP simulations, we searched 100 structures by RS, except for the Si 32 system. Because of the large number of atoms in the Si 32 system, we generated 300 random structures.

Evolutionary algorithm
EA can generate potentially stable structures more efficiently than RS. In EA a population of structures is evolved, driven by survival of the fittest. Some lower-energy structures can become parents of a new generation of structures, and their offspring are generated by evolutionary operations such as crossover, permutation, and strain. The new generation is composed not only of structures produced by the evolutionary operations but also of random structures and structures taken over by the elite selection. We have implemented EA in CrySPY and reported details of the implemented operations and parent selection methods [24].
Except for the Si 32 system, the number of the population was set to 20, and the structure search was conducted up to five generations. In other words, a 100 structure search was also performed in EA. For Si 32 , we used the population of 30 and 10 generations, i.e. 300 structures. The tournament selection was used to select the candidates for parents. The parameters required for the EA used in the CSP simulations are summarized in Table 1. Here, the crossover, permutation, and strain represent the number of structures generated by these evolutionary operations, and their sum including the number of random structures equals the population. The permutation is useless for single element crystals such as Si. The fittest means the number of surviving structures that can be candidates to produce offspring. See our previous study [24] for details of the parameters.

Bayesian optimization
BO is used in combination with the random structure generation in CrySPY. A large number of initial structures are randomly generated at the beginning, and some structures are randomly selected and locally optimized to prepare the first training data. Then, the potential candidates can be efficiently selected by BO. We can efficiently search for the most stable structure among the initial structures by repeating this procedure. The details of the BO procedures are described elsewhere [21,22,24,42].
In the CSP simulations, we employed Thompson sampling [43] as an acquisition function. F-fingerprint of Oganov and Valle [44], which is based on radial distribution function, was used as the structure descriptor. Here, minimum and maximum cutoffs of 0.5 and 5.0 Å, and the smearing parameter of 1.0 Å for radial distribution function were used. Except for the Si 32 system, 300 structures were randomly generated at the beginning, and the search was conducted with learning and selection in every five trials until 100 structures were reached. For Si 32 , 600 initial structures were generated, and the learning and selection were performed in every 10 trials until 300 structures. The advantage of BO is that many structures can be treated as candidates.

Hybrid algorithm of BO and EA
So far, BO has been used in combination with random structure generation. Now that EA can be used with CrySPY, we can use a hybrid method, BO-EA where the structures are generated by EA and selected by BO. In BO-EA, the simulations proceed basically in the same way as BO, however, in the middle of the simulation, EA is used to increase the number of candidates.
Except for the Si 32 system, only 100 structures were randomly generated at the beginning. When the search reached 50 structures, 100 structures were added by EA as the second generation. When the search progressed further and reached 75 structures, EA added 100 structures again (third generation), making a total of 300 candidates. This means that we have selected 100 structures out of 300 candidates in the end. For the Si 32 system, 200 structures were generated at first and added another 200 structures (second generation) by EA when the search reached 100 structures. Moreover, 200 structures were also generated (third generation) by EA when the search reached 200 structures. We have searched a total of 300 structures in this system as with the other algorithms.
The EA parameters used in BO-EA are listed in Table 2. When adding structures by EA, the parent was selected by the survival of the fittest and tournament selection from all optimized structures. At this time, we did not add random structures as we have already prepared enough random structures at the beginning. Since the number of optimized structures was different, different values for the fittest and tournament size were used in producing the second and third generation.

Si
Needless to say, the most stable structure is the diamond structure for Si. Searching efficiencies of RS, EA, BO, and BO-EA have been investigated in the two systems of Si 16 and Si 32 . The former is a relatively easy system to find the most stable structure, and the latter is slightly difficult. The CSP simulations described above were performed to verify whether the diamond structure was obtained or not. To statistically investigate searching efficiencies, we repeated these simulations 50 times in each algorithm and evaluated the success rates of obtaining the stable structure. Figure 1 shows the success rates as a function of the number of search trials for the Si 16 system. When the number of search trials eventually reached 100 structures, for example, RS was able to find the stable structure in 32 out of 50 simulations, giving a success rate of 64%. The final success rates were 60%, 68%, and   Generations  5  10  5  Population  20  30  20  Crossover  12  20  10  Permutation  0  0  4  Strain  6  8  4  Random  2  2  2  Elite  2  3  2  Fittest  10  20  10  Tournament size  2  3  2 82% for EA, BO, and BO-EA, respectively. We found no clear difference among the success rates of RS, EA, and BO. Random structure generation is sufficiently efficient to find the most stable structure in this easy system. The final success rates of both RS and BO were comparable, however, BO was superior to RS at 20 to 40 search trials. This indicates that BO can often find the stable structure in a small number of trials by machine learning. The highest success rate was observed for BO-EA. We can see that the success rate is increasing, especially after adding the structure by EA (the red vertical dashed lines in Figure 1). To be more precise, we can see a significant improvement in the success rate about 10 trials after adding the structure by EA. When structures were added by EA, the number of unexplored regions increased. It was necessary to search the unexplored regions immediately after the structures were added. As the search progressed, the number of unexplored regions decreased, and stable structures were reached in many cases. BO-EA is the best because it was possible to generate as many as 200 structures by EA, while pure EA only treated 100 structures including some random structures. Figure 2 illustrates the success rate for the Si 32 system, drawn in the same way as in Figure 1. Note that we carried out a 300-structure search since this system has a large number of atoms, making it difficult to find the stable structure. The success rate of RS was lower than others because the system became difficult. Although the success rate of BO was better than that of RS, BO was not as high as EA. This is because the structure generation in pure BO relies on random structures. Random structure generation is no longer efficient in this system, on the other hand, the success rate of EA was higher than RS. Structure generation by EA is efficient in complex systems with a large number of atoms. The algorithm with the highest success rate was again BO-EA. These results show the BO-EA is superior to EA and BO since BO-EA has the advantages of both EA and BO.

Binary systems
Here we discuss the searching efficiency of the intermetallic compound Y 2 Co 17 and the ionic crystal Al 8 O 12 . The stable crystal structure of compounds with a 2:17 ratio of Y to Co is known as the Th 2 Ni 17 structure (space group P6 3 =mmc) [45]. The primitive cell contains 38 atoms. Th 2 Zn 17 structure (space group R � 3m) is also known for Y 2 Co 17 [46], and its primitive cell consists of 19 atoms. The target structure in this work was the Th 2 Zn 17 structure since the number of atoms was limited to 19 atoms in the CSP simulations. The success rates in each algorithm for the Y 2 Co 17 system are shown in Figure 3. Although the number of atoms in this system is not so different from that in   Si 16 , the doubling of the number of atomic species makes the CSP difficult, and the success rate of RS was reduced. The success rate of RS was inferior to other algorithms, and the probability of reaching the stable structure by random structure generation only was low. On the other hand, the success rate of BO was clearly higher than that of RS, which is evidence that machine learning worked well. The success rate of EA was better than RS as well as BO. Structure generation by EA showed a high likelihood of reaching the stable structure. The success rate of BO-EA, which allows structure generation by EA and efficient selection by machine learning, was still the highest.
The stable structure of Al 2 O 3 is known to be the corundum structure, of which primitive cell is composed of four Al and six O atoms. However, the Al 4 O 6 system is quite easy to find the stable structure, and not suitable for investigating the success rates. Therefore we adopted the Al 8 O 12 system. Figure 4 exemplifies one of the stable structures of Al 8 O 12 , drawn by VESTA [47]. With the doubling of the number of atoms, the stable structure can only be achieved in a slightly unnatural and elongated cell, such as a (1 � 1 � 2) supercell. It was quite difficult for RS to reach such a stable structure with slightly unnatural cells, and the success rate of RS was 0%, as shown in Figure 5. Since random structure generation no longer works in this system, the success rate of BO has also dropped to 0%. However, EA was able to reach such a difficult structure although the success rate was low. With the efficient structure generation of EA, the success rate of BO-EA increased after the addition of candidates by EA and finally became the highest. Considering that the success rate is 0% until 50 search trials in BO-EA, the success rate would be much higher if we added the candidates by EA a little earlier. These results demonstrate that EA is further increasing the efficiency of the search with BO.

Ternary system
BO commonly has a problem in high-dimensional search space [48,49]. Selecting good candidates by BO in multi-specie systems becomes more difficult since the dimensions of the structure descriptors increase. For example, since the number of pairs of atomic species in a ternary system is twice as large as in a binary system, the dimension of the descriptors in the F-fingerprint is also twice as large. In this work, the descriptors for BO were 60-dimensional in the binary systems and 120-dimensional in the ternary system. To investigate the applicability of BO selection to ternary systems, the CSP simulations were performed for CuGaS 2 .
The stable structure of CuGaS 2 is a chalcopyrite structure as shown in Figure 6. The primitive cell is composed of Cu 2 Ga 2 S 4 , and if we ignore the atomic species, the atomic arrangement itself is equivalent to the diamond or zinc blende structure. We performed the CSP simulations in the Cu 4 Ga 4 S 8 system (16 atoms), taking into account the comparison with the Si 16 system. Figure 7 shows the success rates of Cu 4 Ga 4 S 8 in each algorithm. In RS, it was relatively easy to obtain a structure with the same  atomic configuration as the zinc blend structure by ignoring the atomic species of the cations (Cu and Ga). However, we could not reproduce the perfect configuration of the chalcopyrite by RS, leading to a success rate of 0%. This is because the number of possible structures explosively increases with three atomic species, and only 100 random structures could not produce good candidates. The success rate of pure BO was also 0% since random structure generation could not reach the chalcopyrite structure. On the other hand, the chalcopyrite structure was generated with a probability of 12% using EA. Furthermore, the success rate of BO-EA was as high as 26%. These results indicate that although the probability of generating the chalcopyrite structure is not high, it can be preferentially selected by BO as long as the initial structure is included among the candidates. The selection by BO efficiently worked even in the ternary system, and BO-EA proved to be the best among these four searching algorithms.

Conclusion
We performed the CSP simulations to evaluate the success rates of RS, EA, BO, and BO-EA for the Si 16 , Si 32 , Y 2 Co 17 , Al 8 O 12 , and Cu 4 Ga 4 S 8 systems using CrySPY. Except for the simple system of Si 16 , the success rates of EA were completely higher than those of RS. Although the success rates of RS and BO in the Al 8 O 12 and Cu 4 Ga 4 S 8 systems were 0%, the success rates of BO were higher than those of RS in the other systems. EA can generate structures more efficiently than RS, and BO can select potential candidates from a large number of structures. The success rates of BO-EA, which has features of both BO and EA, were the highest in all systems. BO was concerned that it would not work well in high-dimensional space. However, the selection by BO efficiently worked even in the ternary system. In the CSP of this study, we were not able to verify the dependence on many parameters such as the number of trials, the number of structures to be added in EA, and dimension of structure descriptors, which will be future work. These results demonstrate that the hybrid algorithm of BO and EA proved to be the most promising algorithm among them.

Acknowledgments
Part of the computation in this work was conducted using the facilities of the Supercomputer Center in the Institute for Solid State Physics, and the Fujitsu PRIMERGY CX400M1/ CX2550M5 (Oakbridge-CX) in the Information Technology Center, The University of Tokyo.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported by Materials research by Information Integration Initiative (MI 2 I) project of the Support Program for Starting Up Innovation Hub from Japan Science and Technology Agency (JST). It was also Figure 6. Conventional cell of chalcopyrite structure, stable structure of CuGas 2 , drawn by VESTA. Blue, light green, and yellow atoms represent Cu, Ga, and S atoms, respectively. The solid lines display the conventional unit cell, including Cu 4 Ga 4 S 8 . The primitive cell is composed of Cu 2 Ga 2 S 4 . If we ignore the atomic species, the atomic arrangement itself is equivalent to the diamond or zinc blende structure.