Applications and training sets of machine learning potentials

ABSTRACT Recently, machine learning potentials (MLPs) have been attracting interest as an alternative to the computationally expensive density-functional theory (DFT) calculations. The data-driven approach in MLPs requires carefully curated training datasets, which define the valid domain of simulations. Therefore, acquiring training datasets that comprehensively span the domain of the desired simulations is important. In this review, we attempt to set guidelines for the systematic construction of training datasets according to target simulations. To this end, we extensively analyze the training sets in previous literature according to four application types: thermal properties, diffusion properties, structure prediction, and chemical reactions. In each application, we summarize characteristic reference structures and discuss specific parameters for DFT calculations such as MD conditions. We hope this review serves as a comprehensive guide for researchers and practitioners aiming to harness the capabilities of MLPs in material simulations. IMPACT STATEMENT This review reports on the selection of training sets for machine learning potentials tailored to their specific applications, which is currently not standardized in the rapidly evolving field.


Introduction
Computational material science is a fast-developing field that aims for predicting material properties accurately and discovering novel materials with unique features [1].Recently, a paradigm shift has been occurring, as traditional methods like ab initio calculations are increasingly being supplemented by the power of datadriven approaches [2].The most notable example is the machine learning potential (MLP) [3] in which the total energy and atomic forces are predicted by a machine learning model that is trained over ab initio results for reference structures.
While ab initio methods based on density functional theory (DFT) [4][5][6] yield robust and reliable results across diverse systems, they encounter a scalability challenge, namely O(N 3 ) where N is the number of atoms in the simulation box.This constraint hinders the exploration of larger time and length scales.In contrast, the computational cost for MLPs scales linearly with N owing to quantum locality [7], which allows for large-scale simulations while maintaining prediction accuracy at the DFT level.
Most MLPs are premised on the assumption that the total energy can be subdivided into atomic energies, which should be invariant under rotations, inversions, and permutation of atoms of the same type [7].Under this condition, a variety of descriptor-or graph-based MLPs have been continually developed [8][9][10][11][12][13].Because of the interpolative nature of machine learning, it is critical to prepare training sets that adequately, but not excessively, cover the configuration space of the target simulation (see Figure 1(a)).The difficulty of collecting training sets varies depending on the target properties and material states as schematically shown in Figure 1(b).For instance, thermal properties of crystalline phases are related to the local vibrations of atoms around their equilibrium positions.Sampling these vibrations is relatively straightforward as molecular dynamics (MD) simulations automatically cycle through pertinent configurations, satisfying ergodicity.Sampling the liquid phase is also straightforward because ergodicity is rapidly achieved due to fast diffusion and equilibration.However, more considerations are necessary when developing MLPs for amorphous materials, surfaces, or interfaces.Since these structures consist of a multitude of atomic configurations, a single type of training sets is insufficient.The main challenge then becomes identifying the multiple, potentially unknown, configurations that should be included in the training set.The situation is further complicated for chemical reactions with high activation barriers, which straightforward MD simulations may not sample due to time constraints.
In this review, to assist practitioners of MLPs, we outline the process of curating training sets based on the objectives of the simulations.The application area of MLPs is ever increasing and we here limit discussions on the four following subjects, roughly in the order of simulation complexity: thermal property, diffusion property, structure prediction and phase transition, and chemical reactions at surfaces and interfaces.In the below, we first briefly discuss the types of present MLPs and the basics of constructing training sets in Sections 2 and 3, respectively.Then, we categorize and analyze in Section 4 various training sets to highlight the typical sampling methods.Finally, we summarize and provide a short perspective on this developing field.

Types of machine learning potentials
Every MLP is based on the assumption that the total energy (E DFT tot ) is the sum of the atomic energies.These atomic energies depend on the local environments and are inferred by a ML model: where r i , E ML i and N ðiÞ are the position vector, the atomic energy predicted by MLP, and the neighborhood of the ith atom, respectively.Given that the choice of the frame of reference and the indexing order of the atoms is arbitrary, the atomic energy must remain invariant under Euclidean transformations (such as translation, rotation, and inversion), and permutation of atomic indices.Therefore, ensuring invariance is a crucial consideration in designing MLPs.
There are two types of MLPs: descriptor-based and message-passing graph types.The descriptor-based MLP consists of descriptor and regression model.Descriptors are designed to represent the invariant local environments of atoms, while regression models are used to fit the potential energy surface (PES).Wellknown examples are neural network potential (NNP) [14,15], Gaussian approximation potential (GAP) [9], moment tensor potential (MTP) [10], and spectral neighbor analysis potential (SNAP) [16].There are two popular descriptor-based NNPs: Behler-Parrinello-type neural network (BPNN) [17] and deep potential in DeepMD [15].BPNN uses atomcentered symmetry functions that encode invariant geometries such as distances and angles to represent the local environments of atoms.SIMPLE-NN [18] is one of the packages supporting BPNN.The descriptor of DeepMD potential is computed from the local coordinate frame.GAP is a kernel-based model that uses the smooth overlap of atomic positions as a descriptor and employs Gaussian process regression models to interpolate atomic energies.MTP and SNAP are both linear regression models, but they differ in their choice of descriptors.MTP employs moment tensors composed of radial polynomials and angular parts with Kronecker products of edge vectors.Moment tensors undergo special tensor contraction to obtain many-body polynomial descriptors (or basis functions) that satisfy invariance.In contrast, SNAP utilizes bispectrum coefficients of the neighbor density as descriptors.
The second type is message-passing MLPs which utilize graph-based inputs where atoms are represented as nodes and bonds as edges.Each node feature is iteratively updated by using information from its connected atoms via the application of messagepassing layers.The node feature is refined to capture the corresponding atomic environment after the training process.The resulting node feature is subsequently processed through a neural network to derive the atomic energy.
Message-passing MLPs can be divided into two categories based on the mathematical property of the node features.The first category is invariant-type message-passing MLPs, which utilize scalar values as features that remain invariant under Euclidean transformations.Examples of this category include SchNet [11] and DimeNet [12], which use invariant geometrical information, such as distances and angles between atoms, to construct messages.
The second type is equivariant message-passing MLPs.In these models, when the input graph undergoes a Euclidean transformation, the node features also transform in accordance with the corresponding symmetry operation.Equivariant models have the capability to generate broader functional space than invariant models because invariance can be considered as a special case of equivariance.For example, NequIP [13] employs the tensor product between spherical harmonics and decomposition using Clebsch-Gordan coefficients to obtain features that exhibit equivariance under rotation and inversion.During this process, information is exchanged between equivariant and invariant features.After several message-passing layers, the invariant feature is selected to calculate the atomic energy.
Node features used in message-passing MLPs contain both atomic and geometric information.These features are then processed through a neural network to derive the final energy.This process bears some similarity to the use of descriptors in NNPs.However, there are significant differences between the two.For example, descriptors in BPNN are computed statically with fixed parameters, relying on geometrical features within a predetermined cutoff radius.On the other hand, message-passing MLPs dynamically extract essential geometrical features during the training process.Furthermore, they have the capability to access features beyond the initial cutoff through the message-passing mechanism.

Fundamentals of training set construction
For cost-effectiveness, the construction of training sets leverages the locality assumed in atomic energy (see Equation ( 1)).For instance, when interested in preparing training sets to simulate a large nanoparticle as depicted in Figure 2, providing DFT data for structures that resemble the local environments is sufficient [19][20][21].More specifically, the interior of the nanoparticle can be sampled using the bulk structure, while the face region can be adequately represented by the slab model.For the corner and edge regions, a cluster can be created by slicing the nanoparticle.By decomposing nanoparticle structures in this way, it becomes possible to represent nanoparticles of any size using three smaller parts that can be computed within DFT.
In preparing the training set, one thing to be careful is the possibility of 'ad hoc' energy mapping, in which the total energy evaluated by the MLP appears to be correct due to error cancellation, even though the atomic energy is incorrect [7].This ad hoc energy mapping may lead to failures during MD simulations.
To prevent this, it is advisable to include highsymmetry structures, such as crystalline phases of single elements although they may not directly appear in MD.In multi-component systems, it is also useful to provide training sets with diverse stoichiometries to obtain reasonable energy offsets among different chemical species.
It is also important to consider the density distribution within the training sets.Suppose that MLPs are tasked with learning the PES around a defect in a bulk material.The defect and nearby atoms are underrepresented in the training sets, while bulk atoms distant from the defect are overrepresented.This imbalance in the distribution can inadvertently bias MLPs to optimize towards specific configurations (such as bulk and equilibrium), compromising the accuracy for less represented configurations (like defect or offequilibrium states).To mitigate this issue, additional weights can be assigned during the training process to those atoms or structures that have low density in the training sets.For the formation energy errors or barrier energy concerning defects, this strategy could effectively reduce the errors from 9.5% to 2.6%, in Ref. [22].
Once target structures are selected, DFT packages such as VASP [23] or Quantum Espresso [24] are used to perform static calculations, relaxations, and ab initio molecular dynamics (AIMD).In particular, AIMD is a versatile tool for sampling a wide region of configurational space, which will be discussed in detail in Section 4.Although AIMD is a favored approach, pure AIMD may encounter difficulties in achieving ergodicity in the presence of rare events.This limitation can be mitigated by alternative approaches such as Monte Carlo (MC)-based methods [21,[25][26][27] and biased MD [28][29][30][31][32], which ignore or surmount high barrier energies.One notable example of a biased MD method is metadynamics in the input feature space (G-metaD) [28].Unlike other biased MD methods, G-metaD does not necessitate a priori knowledge of the events of interest to build collective variables.It was demonstrated in Ref. [28] for H adsorption on the Pt (111) surface that the MLP trained by G-metaD showed errors of 0.02 eV across all sites whereas the MLP trained solely on AIMD data exhibited energy errors of 0.1 eV in high-energy sites.
The computational parameters for DFT calculations may vary according to the target properties of interest.For instance, the exchange-correlation functional, inclusion of van der Waals interaction, cutoff energy for the plane-wave basis, and k-point density for the Brillouin-zone integration should be carefully selected to ensure sufficient accuracy in the DFT data.It is also important to provide accurate forces and stress components to reduce training errors by maintaining data accuracy consistent with the total energy.Since the derivative properties such as forces and stresses usually require higher DFT parameters, it is recommended to recalculate DFT data with tight parameters if AIMD is carried out with loose settings.
After a MLP is trained over the initial training set, iterative learning or active learning can be used for augmentation of training sets.Iterative learning is a feedback loop in which the DFT and MLP energy (or force) of structures that are simulated by MLPs are compared, and the configurations with high errors are augmented to training sets.In this case, static DFT calculations are performed for individual simulated structures to decide whether the error is above a set threshold.Instead of a simple iterative approach, selflearning hybrid Monte Carlo (SLHMC) can serve as an efficient on-the-fly sampling method where the Metropolis algorithm is based on the error of MLP energy [33].Since DFT calculations are performed at set intervals, SLHMC can accelerate the sampling of rare events compared to pure AIMD.
On the other hand, active learning can utilize estimated uncertainty as a measure of the reliability of model inference.Uncertainty in active learning can be categorized into two main types.The first is epistemic uncertainty, which arises from the lack of information in the training data for a given testing example.The second is aleatoric uncertainty, which stems from the inherent noise present in the data points.Typically, the MLP training set is labeled using consistent parameters for DFT calculations, resulting in a low noise level for the labels.Consequently, the primary source of uncertainty in the MLP model is epistemic uncertainty.This implies that configurations with high uncertainty are likely to be absent from the current training data, and addressing this issue involves augmenting such configurations to the training set.The difference between iterative learning and active learning is that the former involves extensive DFT calculations throughout the trajectories, whereas the latter relies on uncertainty quantification without DFT.
There are two methods to quantify uncertainty in active learning.The first approach involves directly comparing configurations during simulation with those in the training set.In the case of NNPs, one option is to calculate the Euclidean distance between the descriptor vector of a test example and the vectors of the training data points [34].A smaller distance suggests that a similar configuration is present in the training set.Another option is to use a Gaussian mixture model to fit the distribution of latent vectors from the training set in message-passing MLPs.By examining the likelihood of the latent vector of a test configuration within the distribution of the training set, one can quantify epistemic uncertainty [35].In the case of the MTP, an extrapolation grade γ is introduced using the basis vector of the current structure and a matrix that is composed of basis functions of current training sets [36].Using D-optimality, γ is then computed to decide whether the current image increases the determinant of the matrix (or whether the image lies in the extrapolative region).
The second approach involves using MLP models to quantify uncertainty.The Bayesian approach is commonly regarded as the most rigorous and principled framework for this purpose.Gaussian process regression-based MLPs, such as GAP, are inherently capable of this approach and are also adopted in VASP [37] and the FLARE package [38].In Ref. [38] on a 5-element high-entropy alloy system, active learning reduced the root mean squared error (RMSE) of force from 0.5 eV/Å to 0.45 eV/Å, compared to random sampling.In the case of NNPs, Bayesian inference can be achieved by employing a Bayesian neural network as a regression model instead of a traditional neural network.However, Bayesian neural networks often pose computational challenges due to their high cost.One practical approximation of Bayesian inference is to use Monte Carlo dropout, which can be employed in uncertainty estimation of NNPs [39,40].More simply, one can employ an ensemble of trained models for uncertainty quantification and utilize the variance of the inferred values from each model as a measure of uncertainty.Each member of the ensemble can be constructed by using different initializations of learnable parameters, employing different model structures such as varying the number of layers and nodes, or using distinct random seeds when partitioning the dataset into training and validation sets [41][42][43][44].Both energy and force can be used to measure the variance in the inferred values of each ensemble member.When using atomic forces as an uncertainty measure, one must be careful, as the uncertainty may be inherently low due to symmetry.For example, even though the training sets consist solely of face-centered -cubic (FCC) structures, the force uncertainty for body-centered-cubic (BCC) structures would be low, despite these structures not being included in the training.This can be mitigated by examining atomic energies.For instance, in Ref. [41], a replica ensemble was suggested that examines standard deviations in the atomic energy as an uncertainty measure.The replica ensemble can identify atomic environments that fall outside of the training set.

Thermal property
Thermal properties, in particular thermal conductivities, are crucial for various applications such as photovoltaics [45], thermoelectric devices [46], and thermal barrier coatings [47].As devices continue to decrease in size, effective thermal management becomes increasingly important due to issues like Joule self-heating.As such, it is useful to theoretically predict the lattice thermal conductivity of materials and thermal conductance at the interface.
There are three approaches to computing thermal conductivity: solving the phonon Boltzmann transport equation [48], employing equilibrium MD simulations based on the Green-Kubo theory [49], or utilizing non-equilibrium MD (NEMD) simulations [50].Each of these methods requires a large number of computations on a supercell to achieve converged thermal conductivity.While DFT can provide accurate results, its computational cost is prohibitively high except for high-symmetry systems.As discussed in the introduction, MLPs can serve as an efficient surrogate model owing to their linear scaling.It is crucial for MLPs to accurately represent the PES up to the high-order derivatives.Since direct sampling of the high-order derivatives is computationally expensive, a sampling strategy is required that effectively recovers the derivatives while maintaining a manageable computational cost.Simultaneously, it is essential to train MLPs to achieve sufficiently low errors on the training dataset.
Addressing these challenges, MLPs have been applied to computing thermal properties in various material classes: semiconductors for electronic devices [51][52][53][54][55][56][57][58][59][60] and thermoelectrics [61][62][63][64][65][66], metals (ignoring electron-phonon coupling) [67], two-dimensional materials [68][69][70][71][72], ion conductors [73], and liquids [74,75].In addition, the conductivity of crystalline Si including vacancies was also computed [52].The range of conductivity spans four orders of magnitude in these studies (10 À 1 -10 3 W/m�K), encompassing materials with low conductivity suitable for thermoelectrics, as well as materials with exceptionally high conductivity.We note that the contribution of convection requires special attention [66,[73][74][75], which will be discussed later in this subsection.Figure 3 provides an example of how to calculate the lattice thermal conductivity of graphene/borophene interfaces and hierarchically bridge the results to model polycrystalline structures at the continuum level [72].The first step involves performing AIMD to Beyond crystalline structures and their derivatives, efforts have also been made to calculate the thermal conductivity of superionic [66,73] and liquid materials [74,75] using MLPs.In non-solid systems, the diffusion behavior of atoms becomes relevant in heat transfer processes.While the Green-Kubo relation can be utilized to determine thermal conductivity, it may introduce error due to the inherent arbitrariness of partitioning the total energy into atomic energies.In other words, the spurious contribution of atomic energies should be corrected.One potential solution to this issue involves subtracting the partial enthalpy of each chemical species from the energy flux [75].Other methods formulate the invariance condition of heat flux from the arbitrariness of atomic energy, thereby avoiding additional computations for partial enthalpies.For further details on these methods, which go beyond the present scope on the construction of training datasets, we refer readers to Refs.[66,74].

Diffusion properties
Diffusion is a fundamental process that significantly impacts the performance of various energy storage and conversion systems including batteries and fuel cells [76].In these devices, the rate of diffusion is often the limiting factor, such that diffusivity plays a crucial role in enhancing the power and efficiency of the energy systems.However, obtaining atomistic details of the diffusion process through experimental measurements is challenging, making atomic simulations a favored tool for prediction and analysis.
The diffusion process in diverse materials has been investigated directly using MD.The diffusivity is influenced by various factors such as phase, ion disorder, and temperature.Therefore, comprehensive statistics spanning length, time, and multiple ensembles are necessary for accurately evaluating diffusivity.This poses a significant computational challenge for ab initio methods, making MLPs a useful tool.In generating the dataset for MLPs, it should be ensured that diffusion events are well sampled since they occur stochastically.
As MD is an indispensable tool for studying diffusive processes, it is routinely utilized to generate training datasets for diffusion simulation.For instance, the process of MLP development for LaH 2:5 O 0:25 is demonstrated in Figure 4 [85].Initially, AIMD was used to provide data based on structures consistent with experimental observations, which were subsequently utilized for MLP training.Given that hydrogen diffusion is influenced by the distribution of vacancies and oxygen, it was necessary to train and validate MLP across diverse O 2À configurations.If the O 2À dataset was identified as inadequate, the training set was further augmented.
To expedite the diffusion process, AIMD simulations were typically carried out at high temperatures ranging from 600 to 1200 K, usually higher than the target temperature.However, if diffusion pathways substantially differ between target and sampling temperatures, the MLP simulation may lead to incorrect diffusivity.Therefore, it is necessary to validate the diffusion simulation at the target temperatures by comparing with DFT energies at sampled instances.
The training set consists of a single NPT run or multiple NVT runs with various strains on simulation cells (up to ±0.05).This can accommodate density variations, especially when the systems of interest involve changes in ion content and ion disordering [78,79,[83][84][85][86][87].MD simulations typically last from 3 to 15 ps, with a sampling interval that ranges from 4 to 100 fs.It is common practice to use supercells with dimensions of at least 10 Å, containing a minimum of 64 atoms.
MD continues to be the primary sampling method in studying diffusion in liquids.Although the phase space of a liquid may seem vast due to constant diffusive motions, a few AIMD simulations conducted at temperatures above the melting point are usually sufficient.This is attributed to the fast ionic motions in the liquid phase, which allow for equilibration and ergodicity to be reached quickly.The AIMD simulations contained typically 70-130 atoms with the time span of 20-30 ps.The temperatures higher than the target value were also included for expediting diffusion processes (see Section 4.2.1).This also improves MD stability: In the large-scale, long-term MD simulations based on MLPs, high-energy structures can appear which were not sampled due to the Boltzmann factor during small-scale, short-term AIMD, and MD can fail dramatically.This can be partly addressed by including AIMD at temperatures significantly higher than the target value.If the target density can vary with temperature, multiple NVT MDs with strained cells or multiple NPT MDs at the target pressure are conducted.

Nuclear quantum effect
Nuclear quantum effects (NQEs), such as zeropoint energy and tunneling, can significantly impact the structure and dynamics of systems containing light elements, notably hydrogen.Water and aqueous systems with high concentrations of hydrogen are prime examples in which NQEs influence material properties significantly [101,102].The path integral molecular dynamics (PIMD) is a favored method used to account for NQEs.PIMD utilizes replicas of classical nuclear systems, and each replica necessitates the calculation of energy and force.In Ref. [102], it is stated that PIMD requires at least 32 ensembles, which would pose a significant computational burden for DFT.As such, PIMD with MLPs has proven effective in studying hydrogen diffusion in solids [29,86,87], water [97,103], solutions [98], and interfaces [99].In these studies, the introduction of PIMD led to an accurate description of the quantum nature of hydrogen, explaining the unique temperature dependence of diffusivity.

Structure determination
The determination of material structures is a first step towards understanding their properties.However, experimental procedures to obtain the structure information are usually time-consuming and resourceintensive.To mitigate these challenges, DFT-based methods have been developed to identify structures under given conditions.The structure prediction consists of two elements: navigating the configurational space and evaluating their energies.While several algorithms have been proposed for exploring the configurations [104][105][106], the computational cost associated with DFT remains a major obstacle.This is because the prediction requires searching through an enormous configuration space with a large dimension.It is clear that MLPs can accelerate the structure determination as a surrogate model of DFT.The challenge in training MLPs for this purpose is to collect training data efficiently when structural information is limited or even unavailable.In this section, we will examine four application types related to structure determination, and provide a brief overview of training sets.

Crystal structure prediction
There have been studies applying MLPs for crystal structure prediction (CSP) of elemental materials [107][108][109], high-pressure phases [110][111][112], and multi-component materials at ambient conditions [113,115].These studies attempted to identify the ground-state or equilibrium structure under given thermodynamic conditions by integrating MLPs with CSP algorithms such as genetic algorithm and random structure search.This approach could accelerate the structure search by several orders of magnitude compared to DFT-based methods.
The unique challenge in developing MLPs for CSP, as stated above, is to collect datasets for unknown structures.Some studies have adopted active learning schemes, where all of crystal structures with high uncertainty are added to the training dataset during CSP [107][108][109].However, this might be inefficient because low-energy structures are more important as training data than high-energy ones.To address this issue, the present authors developed a method to build training data for CSP using disordered phases sampled during melt-quench processes [113].It was shown that the MLP trained with melt-quench trajectories can accurately sort the energies of diverse crystal structures of the given composition.Exploiting the efficiency of MLPs, we further developed an in-house code for CSP, called SPINNER that is based on the genetic algorithm [114].
As depicted in Figure 5, SPINNER uses two types of datasets: melt-quench trajectories and theoretical meta-stable crystals.The melt-quenching protocol is used to create the initial dataset.Trained on this set, the MLP is used to conduct CSP, and those crystals with low energy are iteratively added to the training dataset.Following this protocol, SPINNER achieved a success ratio of ,80% in a blind-test with 60 ternary compounds.In addition, stable ternary metal oxides have been identified among unexplored anion combinations [115].

Structure identification at material boundaries
Besides crystalline phases, several studies have examined the surface structures of IrO 2 [116] and metal alloys [117][118][119], grain boundaries [120,121], and interfaces [19,122,123] using MLPs.At these material boundaries, structural properties such as reconstruction and segregation can significantly affect the stability and activity of materials during operation.Theoretical studies can reveal the structural changes at these regions in detail, which are difficult to resolve experimentally.
In constructing training sets, it is standard practice to first identify and sample basic components such as bulk and (bulk-terminated) surface.Initially, bulk structures were sampled through the introduction of random perturbations to atomic coordinates or by performing brief NVT MD simulations, optionally with lattice deformations.Alternatively, NPT MD simulations were also performed.Furthermore, representative orientations or configurations of material boundaries, such as the (111) surface or symmetric tilt grain boundaries were added to the training set.For example, in Ref. [123], it was necessary to identify the configuration of the interface to compute the interfacial energy between low-index surfaces of solid electrolytes and alkali metal electrodes.To create training sets for interfaces, the configuration that minimizes the mismatch between two crystals (consisting of 400-500 atoms) was sought using the surfaces of interest, and two bulk-terminated surfaces formed the interface.Subsequently, the strain caused by the mismatch was induced on the alkali metal surface, which can be justified by the fact that the bulk moduli of alkali metals are typically lower than those of binary compounds (solid electrolytes).From this interface model, the PES was sampled.
Interestingly, MD trajectories with temperatures exceeding the melting point are frequently employed in these types of studies [116][117][118][119][120][121][122][123].This implies that high-temperature MD simulations can encompass diverse configurations of material boundaries, which is reminiscent of the SPINNER approach in the previous subsection.For instance, in Ref. [118], the energy of planar defect structures is accurately predicted, even when these structures are not explicitly included.

Phase transition and phase diagram
Predicting phase transitions and phase diagrams is crucial in the design and processing of materials [124].MLPs have been applied to the phase transition of oxides [58,125] and the phase diagram of metal alloys [126,127], Ga-As [128], and MgO-CaO [129].These studies have demonstrated that MLPs can accurately represent the PES of different phases, estimating relative free energies at the DFT level.For instance, Ref. [128] involved metallic liquid phases and semiconducting solid phases of GaAs.In Ref. [129], it was found that the eutectic phase diagram of MgO-CaO aligns well with the corresponding experimental results.
The phase transformation between two solid phases can be directly sampled using AIMD.For instance, ZrO 2 [58] and HfO 2 [125] undergo phase transitions at high temperatures.These transitions can be captured by MD simulations conducted at temperatures above the phase transition point.The NPT ensemble was employed as the lattice parameters change discontinuously in the first-order transitions.In addition to AIMD capturing the transition, local PES of the two solid structures were sampled at various temperatures.This was required because both harmonicity and anharmonicity should be well sampled to estimate subtle differences in free energies.
Constructing a full phase diagram requires a more extensive training set to determine the stability of both liquid and solid phases across varying stoichiometries and temperatures.Although the sampling space in this case would be larger than that when focusing solely on solid-phase transitions, the methodology is similar, as demonstrated in Refs.[126,129].Initially, host crystalline structures with varying solute concentrations are constructed to serve as the initial configurations.These configurations are then subjected to heating, in step-wise or continuous manners, for durations between 1 and 4 ps.The temperature range spans from room temperature to well above the experimental melting points (up to thousands of kelvins higher).This allows for a comprehensive exploration of both the local vibrational modes in crystalline and liquid structures [129].The NPT ensemble is favored to accommodate thermal expansion.

Structure evolution
Microstructures can evolve in response to varying external conditions, such as shear deformations in boron carbides [130,131], nanoindentation of semiconductors [132], and self-healing of Li metal [133].The studies in Refs.[130,131] examined the brittle failure of the superhard material, boron carbide, and elucidated the role of Al doping in activating mobile dislocations to prevent brittle failure.The MLP simulation in Ref. [132] employed a fictitious indenter to push the surface, emulating the indentation process to compute nanohardness, a property that is challenging to measure experimentally.Reference [133] explored the Li growth mechanism by directly depositing Li on surfaces, demonstrating self-healing behaviors at the surface and in the bulk.
In collecting the training dataset for these applications, reference structures were carefully curated with domain knowledge on the structural evolution.To note, high-temperature MD simulations were conducted in most cases.

Chemical reactions at surfaces and interfaces
Investigating chemical reactions at surfaces and interfaces has been a central topic in materials science, but these processes have proven challenging to elucidate through experimental observations.As such, computational studies using DFT have been favored to understand these processes.However, the high computational cost of DFT hinders realistic modeling of reactions, which can be mitigated by introducing MLPs.
Describing chemical reactions poses the most difficult challenge in developing MLPs -building a training dataset for unknown reaction pathways (see Figure 1).This requires examining the huge configuration space, which is often difficult to sample within AIMD.Furthermore, the bonding nature could change or involve charge transfer during the reactions.In the below, we categorize three types of applications related to reactions at surfaces and interfaces, and discuss on training sets in each case.

Surface and interface modification
The MLP simulations in this section examined the oxidation of Cu [134,135] and Ni silicidation [41], aiming to model the fabrication of nano-metal catalysts or the semiconductor process of metal lines.The composition and structure of surfaces or interfaces changed after reactions, forming a new phase.
In Refs.[41,134], it was possible to generate training sets from AIMD that covered almost entire domain of the simulation, because the activation barriers were sufficiently low.This allowed all the relevant chemical reactions to be sampled during short-time AIMD.The basic building blocks, including bulk crystals, disordered phases, surface structures, and interfaces, were also generated.In Ref. [41], for example, each set of bulk and surface structures, including potential intermetallic compounds, were included for simulation of Ni silicidation (see Figure 6).Additionally, the interface structure was directly simulated to sample the chemical reactions in silicidation (see Figure 6(e)).The typical MD conditions are similar to those described in Sections 4.1 and 4.2.It is important to note that the interface reaction can be sampled by AIMD at an elevated temperature (higher than the target temperature) to expedite the reaction, which also makes MLPs more stable at the target temperature.

Gas-surface interaction
Several studies have been conducted on gas-surface reactions [136][137][138][139] and h-BN growth on Pt [140].These studies enlightened the processes of how gas molecules react with the surface during dissociation and recombination.For instance, Figure 7 depicts the transition states and recombination path of CO 2 on a Pt (111) surface, comparing the accuracy of MLP to DFT [138].
Since the relevant chemical reactions may not be sampled in AIMD as directly as in the previous examples, iterative/active learning schemes have been widely adopted to extend the time span of MD to sample these reactions and to eliminate the need for human intervention in determining whether the current configuration is important.While iterative learning involves extensive DFT calculations carried out throughout the trajectories by MLPs to label the errors [136,140], active learning utilizes one of the methods explained in Section 3 to minimize DFT computation as much as possible [137][138][139].Typically, preliminary MLPs performed simulations under the target conditions, and the labeled configurations were added to the training sets.Additionally, temperatures can be raised to accelerate chemical reactions in data augmentation.In Ref. [139], the mean adsorption energy error was reduced from 82 meV after initial sampling to 12 meV following the final active learning cycle.

Chemical reactions with water
Many studies applied MLPs in investigating reactions or interactions involving water, including reactions at the interface of water and solids [30,[141][142][143][144][145][146] and dissolution processes in water [31,32,147,148].MLP simulations at the watersolid interface investigated proton transfers and solvation structures, focusing on hydrogen diffusion and hydration free energy.For dissolution processes in water, the simulation revealed the mechanisms of dissolution and hydrolysis, highlighting critical roles of molecular solvation and proton transfer in dissolution processes.
The strategy for building training sets starts with decomposing the target structures into basic components such as molecules, bulk liquid water, and surfaces.In addition, small interface models between solid and liquid water or vapor were also included.Furthermore, as the solute molecule interacts with water, ionic species can be generated from the initial molecules.Solvated molecules and ions were sampled using AIMD.Based on these elementary structures, PES was further sampled with iterative/active learning.
In addition to active learning, other sampling techniques such as umbrella sampling [149] and metadynamics [150] were also utilized, which can overcome  limitations of pure MD-based sampling.To expedite the sampling of rare events in chemical reactions, fictitious potentials can be utilized to facilitate the rapid crossing of energy barriers, thereby enabling the sampling of diverse configurations that are relevant for reactions [29][30][31][32].In these studies, building efficient bias for sampling may require prior knowledge of the events of interest.

Concluding remarks
In this review article, we outlined sampling strategies with regard to the target simulation.We examined four classes of MLP applications, arranged in order of increasing complexity.Regardless of application types, AIMD simulations were always useful for structure sampling, particularly at high temperatures, which activate relevant processes within the short simulation time.We also note that disordered structures such as liquid and amorphous phases appear to contain local information relevant for stable and meta-stable structures of crystals, surfaces, and interfaces.For simulating complex systems involving multi-component systems, the practitioners first identify baseline structures such as crystals and clean surfaces and then augment the training set with small-scale AIMD or iterative/active learning.
There are still remaining challenges for obtaining training sets for complicated reactions in multicomponent systems.Future works should deal with this at a reasonable cost of DFT calculations.Lastly, we remark that the sampling method for MLPs is a still evolving field, and new techniques are developed constantly.We expect that in near future, an automatic generation of training sets according to application types become feasible.
Seungwu Han earned his PhD in Physics from Seoul National University, where he currently serves as a professor in the Department of Materials Science and Engineering.His specialization lies in the application of computational materials science to solid-state physics, utilizing density functional theory and machine learning potentials.He has developed original packages, which include AMP 2 , an automated program for DFT calculations, SIMPLE-NN for generating machine learning potentials, and SPINNER for predicting crystal structures.Additionally, he established SNUMAT, a database devoted to providing accurate material bandgaps, taking into account stable magnetic orderings when necessary.

Figure 1 .
Figure 1.(a) Schematic representation of the relationship between training set and target simulation in configuration space.(b) The difficulties of constructing training sets according to the target phase or simulation type.

Figure 2 .
Figure 2. Illustration detailing the various atomic structures within a nanoparticle.
generate training sets for MLPs.Once MLPs were trained and validated, large-scale NEMD was conducted to calculate the thermal conductance at interfaces.This conductance then served as parameters in continuum-level simulations, which could model the dependence of effective conductivity on the domain size in graphene/borophene heterostructures.As in the aforementioned example, MD is most commonly used for generating the training dataset as MLPs can learn high-order derivatives from multiple configurations derived from MD.The typical conditions of MD for solid-state materials are as follows:The starting structure consists of a supercell containing 60-128 atoms.The NVT ensembles are employed with the lattice parameters optimized by DFT.If thermal conductivity at different volume or coefficient of thermal expansion is required, NVT ensembles with strained cells are also included or NPT ensembles are used.The temperature range spans from 50 to 1000 K, with 3-5 temperatures selected at intervals of at least 100 K. Low-temperature MDs sample harmonic vibrations, while high-temperature MDs can sample anharmonicity.The simulation temperatures should be below the melting point (or the phase transition temperature) because data from different phases are rather redundant.The simulation time ranges from 1 to 20 ps, with sampling intervals of 4-100 fs.To save computational costs in AIMD, a two-step approach can be employed.First, trajectories are acquired by using low-accuracy DFT calculations (e.g.default cutoff energy for plane-wave basis, Γ-point sampling, and the loosened self-consistency criteria) or classical potentials.Then, accurate energy and force information is obtained at sampled snapshots with highaccuracy DFT parameters.

Figure 3 .
Figure 3. Main steps in ab initio multiscale modeling framework for simulating the lattice thermal conductivity of graphene/ borophene heterostructures (used with permission of Royal Society of Chemistry, from Ref. [72], copyright 2023; permission conveyed through copyright clearance center, Inc).

Figure 5 .
Figure 5. Schematic representation of the CSP workflow in SPINNER.Top left: initial training of an MLP is carried out on disordered structures, sampled from melt-quench-annealing trajectories of a given composition with AIMD.Top right: the initial MLP undergoes iterative refinement using low-energy theoretical structures acquired during 400 generations of structure search.Bottom left: the structure search is performed by the refined MLP for up to 5000 generations, with the quality of the MLP assessed every 1000 generations.Bottom right: the energies of final candidates are evaluated using DFT calculations (adapted from Ref. [114] under CC BY 4.0).

Figure 6 .
Figure 6.Collection of foundational constituent dataset for the target reaction.(a) The target simulation of the interface reaction, (b) Bulk crystals, (c) Bulk liquid and amorphous structures, (d) Surfaces, and (e) Interfaces (adapted with permission from Ref. [41].Copyright 2023 American chemical society).

Figure 7 .
Figure 7.Comparison of energies between MLP and DFT along a chemical reaction pathway of CO 2 generation (reprinted with permission from Ref. [138].Copyright 2023 American chemical society).