Configuration sampling in multi-component multi-sublattice systems enabled by ab Initio Configuration Sampling Toolkit (abICS)

ABSTRACT Simulation of the intermediate levels of disorder found in multi-component multi-sublattice systems in various functional materials is a challenging issue, even for state-of-the-art methodologies based on first-principles calculation. Here, we introduce our open-source package ab Initio Configuration Sampling Toolkit (abICS), which combines high-throughput first-principles calculations, machine learning, and parallel extended ensemble sampling in an active learning setting to enable such simulations. The theoretical background is reviewed in some detail followed by brief notes on usage of the software. In addition, our recent applications of abICS to multi-component ionic systems and their interfaces for energy applications are reviewed as a demonstration of the power of this approach. GRAPHICAL ABSTRACT IMPACT STATEMENT We review our machine-learning accelerated configurational sampling approach along with original software that enables treatment of order/disorder in many-component multi-sublattice material systems beyond the reach of conventional methods.


Introduction
Inorganic crystalline materials for applications including energy conversion/storage, catalysis, nanoelectronics, and structural materials always contain some amount of impurities, lattice defects, and configurational disorder that depend on the processing conditions.Very few 'pure' materials are used in actual applications; rather, introduction of impurities and/or disorder is employed almost ubiquitously for improving the properties and functionalities of various materials.For example, insulating materials are turned into ion conductors for solid state batteries and fuel cells through impurity doping and defect engineering [1][2][3][4], and high-entropy (i.e., disordered) alloys are under intense research for magnet and structural applications [5].Control of atomistic configurations is of interest also for phonon engineering [6].
When the material can be considered a random alloy, the coherent potential approximation (CPA) [7][8][9] combined with the Korringa-Kohn-Rostoker method [10,11] is a very fast and powerful electronic structure method for properties prediction from first principles.Another approach that can be used in standard density functional theory codes is the supercell approach based on special quasi-random structures [12,13].
These methods enable single-shot calculations of a completely random alloy.
On the other hand, complex ionic solids often show strong defect/impurity association due to Coulomb interactions between ions with varying charges.This means that there is some amount of 'order within disorder', and the assumption of a completely random alloy breaks down even at sintering temperatures nearing 2000 K [14].Moreover, it has been pointed out recently that significant levels of short-range order exists even in high-entropy alloys with a profound effect on the thermodynamics and the resulting magnetic and mechanical properties [15].In these cases, thermodynamically relevant lattice configurations need to be elucidated before properties prediction.An exhaustive calculation over all possible configurations is impossible except when impurities and defects are very dilute, and in such cases, various importance sampling methods are employed.First-principles molecular dynamics can be used for sampling structures of quickly-relaxing (i.e., liquid) systems, but it is nearly useless for most solid-state systems with high energy barriers between different configurations.On the other hand, Monte Carlo (MC) algorithms are not bound by the physical relaxation times because they can employ unphysical trial steps such as the swapping of atom positions.Moreover, in systems that can be mapped onto a lattice, it is often sufficient to sample configurations that are relaxed starting from an ideal lattice with varying occupations of each site by different atomic species or vacancies.Thus, the natural way to go would be to combine MC simulations with first-principles relaxation and energy calculations.However, because of the high computational cost of first-principles calculations, such reports are still very few [14,[16][17][18][19]; instead, the standard way is to fit a cluster expansion Hamiltonian to first-principles results on a limited number of configurations and then use that effective model for MC sampling [20][21][22][23][24][25][26][27].
Although cluster expansion has been very successful, its application to manycomponent multi-sublattice ionic systems is often impractical due to combinatorial explosion in the necessary number of terms (i.e., clusters) in the expansion (see Ref. [28] for a recent reformulation and review of the cluster expansion method with an emphasis on application to multicomponent ionic materials).To deal with this, we proposed to use more flexible non-linear models, i.e., machine-learning potentials that have seen explosive development in recent years for enabling molecular dynamics simulations with near-first principles accuracy at a small fraction of the cost.The nonlinearity of the model should, in principle, enable higher accuracy with less complex descriptors than linear models such as cluster expansion.Moreover, machine learning potentials usually map continuous coordinates to continuous energies, but we showed that it can map coordinates on a lattice before relaxation to the total energy after relaxation [29], which is the only information needed for thermodynamic sampling on a lattice.This affords a speedup by a factor of 10 to a few hundred compared to performing relaxations on each configuration encountered during MC sampling.
Based on the above ideas, we have been developing an open source framework abICS (ab Initio Configuration Sampling), which facilitates the combination of highthroughput first-principles calculation, training of the machine learning configuration energy model, parallel extended-ensemble MC sampling using the trained model, and iterative improvement of the model through active learning cycles.In this Review, we report the basic algorithms and features implemented in abICS, some details of the installation and usage of the code, and a few example applications.Finally, we will conclude with some comments on future development of abICS.

Active learning workflow using abICS
The workflow of the active learning method implemented in abICS is shown in Fig. 1.A typical procedure is outlined as follows, which involves Python programs abics mlref, abics train, and abics sampling which are part of the abICS package.

Generation of initial training dataset of random configurations (abics mlref)
abics mlref is used to generate random configurations on an ideal lattice and prepare input files for first-principles relaxation and energy calculations using VASP [30], Quantum ESPRESSO [31], or OpenMX [32].The user is tasked with setting up how to run the first-principles calculations: how many CPU cores (or even GPUs) to be used for each configuration, how many configurations are calculated in parallel, adapting job submission scripts for their supercomputer center to enable high-throughput calculations, etc.After finishing the calculations, abics mlref is rerun to convert the results into a common format independent of the first-principles calculation code.

Neural network Training (abics train)
After conversion of the first-principles calculation results using abics mlref, abics train is used to train the configuration energy model on those results.Currently, abics train interfaces with aenet [33,34] for training a neural network model.

Monte Carlo sampling (abics sampling)
Next, abics sampling is used to perform replica exchange Monte Carlo (RXMC) sampling [35] or population annealing Monte Carlo (PAMC) sampling [36] using the trained neural network configuration energy model (the algorithms are explained in Sec.3.2).abics sampling can perform canonical sampling, which maintains a fixed number of atoms, as well as grand canonical sampling, which allows for changes in the composition.The latter is still in the testing phase.At each MC sampling step, abics sampling invokes a model calculator (only aenet is supported as of present) to evaluate the energy of sampled configurations.

Validation and iterative refinement (abics mlref)
After MC sampling, abics mlref is used to take a subset of configuration samples from the MC results and produce corresponding input files for first-principles calculations.The user runs these calculations, then abics mlref is run again to convert the results to a common format to be read by abics train.At the same time, a plot that compares the configuration energy model to the first-principles results are produced, which the user can use for validation of the model accuracy.If the accuracy is insufficient, the data is added to the training data set, and the procedure is repeated from the training step (Sec.2.2).This iterative active learning process refines the neural network model, improving its accuracy with the number of iterations.

Production Monte Carlo run (abics sampling)
Once a neural network configuration energy model with sufficient accuracy is obtained, it is recommended to perform a production MC run using an expanded supercell and/or a larger number of sampling steps.abics sampling can be used for this task.Generally, we have found that the number of MC steps needed for obtaining converged statistics (e.g., short-range or intermediate-range order parameters, similarity measures, ion concentration profiles, etc.) is much larger than that necessary for generating sufficient training data during the active learning iterations.

Algorithms
In this section, we describe the main algorithms implemented in abICS.

On-lattice configuration energy model
As mentioned above, abICS relies, at present, on aenet code [33,34] to train the energy model.aenet implements the high-dimensional neural network potential (NNP) originally proposed by Behler and Parinello [37][38][39], which assumes that the total energy can be expressed as a sum of atomic energies that are determined by the environment around each atom and uses neural networks to fit these atomic energies.The same network is used for the same atomic species, so the total energy is expressed as where ⃗ σ Rc i represents the environment around atom i within a finite cutoff distance R c , f is a fingerprint function that maps the atomic environment to a multidimensional descriptor vector, t i is the species type of atom i, and NN ti denotes the neural network for species t i that maps the environment descriptor vector to an atomic energy.As the fingerprint function, aenet, and thus abICS, support symmetry functions [37][38][39] and Chebyshev descriptors [34].In the abICS scheme, the same model is used to map a configuration on an ideal lattice to the energy after relaxation, so that the total energy after relaxation may be expressed as where E rel (⃗ σ) represents the energy after structural relaxation from a starting structure ⃗ σ, f [⃗ σ Rc i ] represents atomic environment descriptors for the starting structure, and NN rel ti represents atomic energy contributions to the relaxed total energy.The training and use of the model is restricted to ⃗ σ ∈ {⃗ σ lattice }, where {⃗ σ lattice } represents the set of ideal lattice structures with configurational disorder.
An important point to note is the option, or sometimes the necessity, to ignore one or more species in the on-lattice model.In the case of a two-component alloy with fixed composition, it is possible to train the on-lattice model to predict the energy based on the occupation of only one of the components.It is also advisable to ignore a component when that species has the same occupation of sites in all considered configurations; a typical example is when considering disorder only in the cation sublattice in multi-component oxides.Ignoring a component is often also necessary because machine learning usually starts by normalizing the variances of descriptors.This means that if all atoms of a certain species has the same local environment, one can end up with NaN (Not a Number) errors due to zero variance.

Monte Carlo method
One of the main features of abICS is to generate atomic configurations obeying the canonical distribution p(⃗ σ; β) = exp(−βE(⃗ σ))/Z(β) at a given temperature T , where ⃗ σ denotes a configuration, E(⃗ σ) is the energy of the configuration, β = 1/T is the inverse temperature, and The Markov chain Monte Carlo (MCMC) method generates a series of configurations { ⃗ σ t } efficiently.(t denotes an index of MC steps.)In the MCMC method, a trial configuration ⃗ σ ′ is proposed with a transition probability T (⃗ σ → ⃗ σ ′ ).The trial configuration is chosen by slightly modifying the current one ⃗ σ = ⃗ σ t , for example, by exchanging the positions of two atoms.Then, the candidate configuration is accepted with the acceptance probability p ), the distribution of the generated configurations {⃗ σ t } converges to the desired distribution, p(⃗ σ) = W (⃗ σ)/Z after a sufficiently large number of configurations are generated. 1The expectation value of an observable A at the inverse temperature β can be obtained simply by taking the average over the generated configurations; where N MC is the number of the whole MC steps and N th is the number of the burn-in steps.
To calculate the proper acceptance probability, the Metropolis-Hasting method is usually used.In the symmetric case, . This means that the MCMC method can escape from the valley (local minima) with an energy depth ∆E ∼ T .In other words, the MCMC method tends to get stuck in local minima, particularly at low temperatures.
In the past few decades, several advanced algorithms have been developed to overcome this problem, and the extended ensemble algorithm is a general one among them.This algorithm extends the configuration space and/or modifies the distribution, for example, by changing the temperature as a configuration parameter.At present, abICS implements two types of extended ensemble algorithms: The replica exchange Monte Carlo (RXMC) method [35,40] and the population annealing Monte Carlo (PAMC) method [36,41].Both algorithms suit massively parallel computers, although it has been suggested that the PAMC method is more suitable than RXMC method for massively parallel implementations [42].We briefly introduce them in the following.

Replica exchange Monte Carlo method
In the RXMC method [35,40], N MCMC random walkers (replicas) at different inverse temperatures β 1 , β 2 , . . ., β N run in parallel.At preset intervals, the temperatures are exchanged according to the following procedure; a replica i with the energy E i and the inverse temperature β i and another replica j with E j and β j swap the temperatures with the probability p = min 1, e −βiEj e −βjEi e −βiEi e −βjEj = min 1, e (βi−βj)(Ei−Ej) .
Focusing on one replica, the temperature of the replica can increase and decrease through the simulation, and hence the replica that is trapped in a local minimum has a chance to escape when the temperature becomes increased.Since the exchange probability is derived by the Metropolis-Hastings algorithm, after a sufficiently large number of MC steps, the exchange operation does not distort the distribution, and therefore the generated configurations with temperature β i obey the canonical distribution p(⃗ σ; β i ).

Population annealing Monte Carlo Method
The population annealing Monte Carlo (PAMC) method [36] extends the simulated annealing (SA) method.In SA, the MCMC method starts at a high temperature and the temperature is gradually lowered.This simple procedure is widely used to search for the configuration with the lowest energy, that is, the ground state.The temperature-decreasing process, however, leads to a distortion in the distribution of random walkers, deviating from equilibrium.This is why the SA method is not suitable for generating configurations at each temperature; for that, a burn-in process would be required for every temperature increment.
To address this issue, the annealed importance sampling (AIS) method [41] is proposed.The AIS prepares N independent replicas and runs the SA method on these replicas with β = 0 in parallel.The AIS method introduces additional weights w i called Neal-Jarzynskii (NJ) weights to each replica i in order to compensate for the distortion of the distribution.The initial value of w i is 1, and on decreasing the temperature from β to β ′ > β, w i is updated as By using the NJ weights, the expectation value of an observable A under the inverse temperature β is the weighted average over the replicas; Since replicas with lower energies tend to have larger NJ weights, the variance of the NJ weights increases as the simulation progresses particularly when the energy landscape has many local minima.It reduces the number of effective samples and results in poor computational accuracy.The PAMC method mitigates this problem by introducing resampling based on the NJ weights of the replicas.After each resampling, all NJ weights are reset to unity.This resampling step helps stuck replicas escape from local minima and reduces the variance of the weights.

Free energy
The free energy F , which is connected with the partition function as Z = e −βF , is a crucial quantity to investigate the stability of the model because nature favors the model with lower free energy.While the partition function itself is difficult to calculate by the MC method, the ratio of the partition functions between two temperatures can easily be calculated as The partition function in the high-temperature limit (β = 0) is just the volume of the configuration space Ω, and hence the partition function at other temperatures can be calculated as Z(β) = ΩZ(β)/Z(0).Unfortunately, the accuracy of the Eq.( 8) becomes worse as the difference between temperatures becomes larger.This is because the energy range contributing to the partition function depends on temperature; generally, this is a small region around the expectation value of energy ⟨E⟩ β .To overcome this problem, Eq. ( 8) is evaluated iteratively as where The RXMC method and the PAMC method generate the configurations for several temperature points, and thus this algorithm can be easily applied in these methods.

Representation of the structure
The basic unit cell that defines the ideal lattice with configurational disorder can be separated into a "base structure" whose atom occupations are held fixed during MC sampling and one or more "defect sublattices" on which MC sampling is performed (Fig. 2(a)).Internally, a defect sublattice is represented by an array of site coordinates along with a list of "atom group" types that reside on each site.In abICS, an atom group can be a single atom, a group of atoms (e.g., a molecule residing on a lattice site) or a vacancy.Multiple sublattices with different sets of atomic species can be defined.For example, it is natural and often more efficient to define cation and anion sublattices for ionic solids.It is also possible to define a single sublattice with anion and cation mixing, but in many systems, anions residing on cation sites and vice versa are unstable and contribute very little to the thermodynamics.The internal representation of sublattices described above enables easy implementation of configuration sampling on a lattice.For random configuration generation in the initial step of active learning, the atom group list of each defect sublattice is randomly shuffled and converted to input for first-principles calculation code.In Monte Carlo sampling, trial atom swaps or orientation changes of atom groups (Fig. 2(b)) are performed by swapping elements of the atom group list and converting to input for the configuration energy calculator.

Structure mapping
The success of a lattice model depends on a good mapping between the structures before and after relaxation.However, in ionic systems especially with vacancies, atoms sometimes relax out of their initial sites into adjacent sites, breaking the one-to-one mapping mentioned above.That is, the relaxed structure sometimes does not correspond to the internal representation that was used to generate the initial structure for relaxation.In this case, it is more appropriate to remap the relaxed structure to the lattice [28].To achieve this, abICS uses a simple distance-based mapping for each sublattice.The remapped structures on the rigid lattice and the corresponding relaxed energies are then used for training the configuration energy model.

Constraints for avoiding inaccessible configurations
Relaxation out of initial sites mentioned above implies that there are lattice representations that are inaccessible after relaxation [28].These inaccessible configurations may still be sampled by Monte Carlo algorithms, which can be an issue because no training data can be accumulated for such configurations.If the energy model happens to predict a low enough energy for such configurations, they will be sampled and the  resulting thermodynamics will be distorted to varying degrees.On the other hand, if it is possible to predict inaccessible configurations from, e.g., chemical intuition, one can impose constraints on the structures that are accepted by Monte Carlo sampling and reject the inaccessible configurations.abICS provides a functionality to impose such constraints.

Installation
abICS is implemented in Python3 (version 3.7 or later), and relies on the following external Python libraries: NumPy [43], SciPy [44], toml [45], mpi4py [46], pymatgen (version 2019.12.3 or later) [47], and qe-tools [48].For installing mpi4py, which handles parallel calculations, the user needs to have a working message-passing interface (MPI) library installed in the system.Since abICS has been registered in Python Package Index (PyPI) repository, users can install abICS easily by the following command:

$ python3 -m pip install abics
If the installation should be done locally, e.g. when the user has no permission to install files outside of their home directory, --user option may be added to the command.
The source code of abICS is available from GitHub repository.Users can download the zipped source archive from the release site, or clone the repository by $ git clone https://github.com/issp-center-dev/abICS.gitAfter unpacking the source archive and moving into the top of the source tree, users can install abICS by typing $ python3 -m pip install .
The installation directory may be explicitly specified by --prefix=DIRECTORY option in which DIRECTORY is the path to the directory where abICS will be installed.
In addition, aenet has to be installed for neural network training and evaluation.We provide interfaces to aenet via either a file I/O-based method (write out aenet input, run aenet as a subprocess, and parse the output file) or through a LAMMPS [49] Python interface which doesn't rely on file I/O.The latter is optional but is generally faster in the MC sampling stage managed by abics sampling.To enable the Python interface, the user must install aenet-LAMMPS [50] as well as the LAMMPS Python module.It is also necessary to have a working installation of Quantum Espresso, VASP, and/or OpenMX for calculating the reference training data.

Input file format
The overall active learning procedure is controlled by an input file in TOML format, an easy-to-read configuration format used in many software projects.The parameters for the first principles calculations (e.g., k-point sampling, plane wave energy cutoff, relaxation algorithm, etc.) and the neural network training/evaluation (neural network structure and various hyperparameters) are controlled by separate files following the formats of the specified calculation code.We will refer to the files as "reference input files".Also, any program that calculates physical properties such as the energy from a configuration is referred to as a "solver"; this includes first-principles calculation codes and neural network evaluation codes.
The input parameter file of abICS consists of five blocks, called sections hereafter, that classify the parameters in categories with labels enclosed by brackets.In the following, we briefly describe the specifications of each section.Further details can be found in the online manual [51].

[sampling] section
The sampling section specifies the parameters for Monte Carlo sampling: for example, when using the RXMC method, the number of replicas, the temperature range, the number of Monte Carlo steps, the replica exchange trial frequency, etc. are given.It contains the [sampling.solver]subsection that specifies the type of energy calculator (referred to as "solver" hereafter) used in the sampling.When performing active learning, aenet is the only choice, but abICS also partially supports direct combination of first-principles solvers (Quantum Espresso, VASP, and OpenMX) with MC sampling.In addition, the path to the solver and the directory containing reference input files are specified.An example of [sampling] section for RXMC sampling with 8 replicas is shown as follows (please see the online manual for PAMC parameters): [sampling.solver]type = "aenet" path= "predict.x"base_input_dir = "./baseinput"perturb = 0.0 run_scheme = "subprocess" ignore_species = ["O"] In this example, type is set to aenet in order to use the neural network model.path specifies the path to predict.x of aenet library that estimates the value of energy corresponding to the structure.The directory that contains the reference input file is given by base input dir.perturb is a parameter for randomly shifting atomic positions prior to structural optimization.This is useful for avoiding high-symmetry saddle points when relaxing the system but should be set to zero as in this example when using the on-lattice model.run scheme specifies the way to invoke the solver.In this example, abICS starts predict.xusing subprocesses.ignore species specifies the atomic species to ignore in on-lattice models as discussed in Sec.3.1.

[mlref] section
The mlref section specifies options applied when the atomic configurations are retrieved from the results of Monte Carlo calculations.These parameters are used, for example, to evaluate the accuracy of the neural network models, or to extend the training data.An example of [mlref] is shown as follows: It contains a subsection named [mlref.solver]to specify the parameters for the solver used to generate training data.The format is the same as that of [sampling.solver].An example is shown below: [mlref.solver] type = "qe" base_input_dir = "./baseinput_ref"perturb = 0.05 ignore_species = [] In this example, type is set to qe in order to use Quantum ESPRESSO.path specifies the path to pw.x, the self-consistent field (SCF) energy solver of Quantum ESPRESSO.The directory that contains the input parameter files specific to each solver is given by ./baseinputref. perturb is set to 0.05 to apply shifting atomic positions prior to structural optimization in units of angstrom.ignore species is set to an empty list to take all species into account during the first-principles calculations.

[observer] section
The [observer] section contains the settings for the physical observables to be measured.It contains subsections referring to the specific types of observables.Examples of [observer] subsections are given as follows: [observer] [observer.similarity]reference_structure = "./MgAl2O4.vasp"ignored_species = ["O"] [[observer.solver]]name = "obs1" type = "aenet" path= "predict.x"base_input_dir = "./baseinput_obs"perturb = 0.0 run_scheme = "subprocess" ignore_species = ["O"] [[observer.solver]]name = "coordnum" type = "user" function = "mymodule.coordnum" In the above example, two types of observables are shown: [observer.similarity]and [observer.solver].The former specifies options for evaluating the similarity of atomic configuration, defined by the ratio of atoms of each element to be found at the same place as those of the reference state.The latter specifies the physical quantities to be evaluated by using the solver.In this example, the first [observer.solver]means that an observable, "obs1" of the configuration is calculated by using the aenet model.A practical example of using this functionality is when a separate on-lattice model is trained to predict physical properties such as the lattice volume [52].The second [observer.solver]means that another observable, "coordnum", is calculated by using a Python function mymodule.coordnumdefined in the following file named mymodule.py2, from pymatgen.coreimport Structure def coordnum(st: Structure) -> float: # write code for calculating, e.g., # coordination number from pymatgen structure # and substitute result into n ... return n

[config] section
[config] section specifies the base structure and one or more defect sublattices explained in Sec.3.3.The contents of this section are briefly described as follows: • [config] specifies the atomic position.The simulation cell (in units of angstrom) is given by unitcell and supercell.Constraints on the structure to avoid inaccessible configurations (Sec.3.5) can be imposed by the optional constraint parameter that specifies a user-defined function to determine whether a given structure is allowed or not.
• [[config.basestructure]] specifies the atoms that are left fixed during the Monte Carlo calculations.type specifies the atom species, and coords specifies the fractional coordinates of the atoms.coords is a list of three-component arrays.It can be given by a string that represent a numeric matrix with three columns separated by white spaces and as many rows as the number of atoms.• [[config.defectstructure]] specifies the position of the atoms that move during the Monte Carlo calculation.• [[config.defectstructure.groups]]specifies the name and number of atoms or atom groups to be moved in the MC calculation.
An example of [config] section is as follows: In this example, abICS will sample the possible configurations of 16 Al and 8 Mg atoms on the lattice specified by [[config.defectstructure]].The constraint on the configurations is imposed by the function constraint func defined in the file constraint module.pywhich takes a pymatgen Structure object and returns True if the structure is acceptable and false if it should be rejected.A trivial example of the constraint function reads as follows: from pymatgen.coreimport Structure def constraint_func(st: Structure) -> bool: return True This constraint functionality was used to constrain the position of interstitial protons to be next to occupied oxygen sites in Ref. [52].
Since writing all the coordinates by hand is quite tedious, we also provide a preprocessing tool st2abics to convert structure formats supported by pymatgen into a template TOML file with the config section filled in.

Output files
The calculation process of abICS consists of alternating executions of training data generation steps, training steps, and Monte Carlo sampling.The progress of steps is recorded in ALloop.progress.The output files generated during these steps are summarized in this section.
In the training data generation steps, abics mlref first populates the ALn directory with subdirectories containing input files and initial structures for first-principles solvers, where n stands for the active learning iteration count staring from zero.abics mlref also generates a list of these directories in rundirs.txt,and the user is tasked with running first-principles solvers in those directories.After the solvers are run, the results are postprocessed by abics mlref for the next training step.Also, the correspondence between the previous on-lattice model prediction and the firstprinciples solver results are written in files named energy corr.datfor validation of the on-lattice energy model.
In the training steps using abics train, the input and output files are stored in a directory train0.The generated on-lattice model is placed in the input directory specified in [sampling.solver].
In the Monte Carlo sampling steps using abics sampling, the input and output files are stored in a directory MCn where n corresponds to the index of the preceding AL step.When the calculation of MCn step finishes, the results are outputted into files of several types.They are summarized as follows.For each MC replica, the files listed below are generated in the individual replica directory: • structure.NNN.vasp:The atomic coordinates for each step are stored in the POSCAR file format of VASP.Here, NNN corresponds to the MC sampling step.• minE.vasp:The structure having the lowest energy among the samples in this replica is stored in POSCAR format.• obs.dat:The temperature and the total energy for each step is shown in units of eV.• obs save.npy:The energy in units of eV at each step is written in the Numpy array format.• kT hist.npy:The temperature in units of eV at each step is written in the Numpy array format.
• Trank hist.npy:The history of the temperature index along the steps is written in the Numpy array format (only for RXMC).• logweight hist.npy:The logarithm of the Neal-Jarzynski weight at each step is written in the Numpy array format (only for PAMC).• acceptance ratio.dat:The acceptance ratio of the Monte Carlo sampling at each temperature is summarized in the text format.
The measured physical observables are written to the files with their labels as filenames.They contain the values of temperature, the average and the statistical error.In logZ.datfile, the logarithms of the partition function log Z i /Z 0 with i the of temperature are written.

Applications
Here, we review some recent applications of abICS to demonstrate the power of the approach.  .The degree of inversion calculated using abICS on a 192-cation cell compared to the cluster expansion literature [53,54] and direct RXMC sampling in combination with first-principles calculation in a 48-cation cell [17].
Monte Carlo sampling of the lattice configuration problem is naturally suited for predicting order-disorder transitions.Thus, as a first demonstration of the abICS active-learning approach, we calculated the degree of cation disorder in A  [29].In an ordered spinel, the divalent A cation is tetrahedrally coordinated by oxygen, while the trivalent B cation is octahedrally coordinated.With increasing temperature, there can be some degree of inversion (DOI) between the two sites, which is defined as the ratio of tetrahedral sites occupied by B cations.There are also some A/B combinations that result in almost complete inversion, and MgGa 2 O 4 is one of them.Fig. 3 shows the calculated DOI compared with some cluster expansion literature [53][54][55] and direct RXMC sampling on first-principles energies [17].The results are in excellent agreement except for some quantitative difference in the DOI of MgGa 2 O 4 , which we speculate is due to the small unit cell used for training the cluster expansion model.The improvement in the accuracy of the neural network configuration energy model with the number of active learning cycles is shown in Fig. 4. The prediction errors decrease to less than 1 meV/cation using less than 1000 DFT samples of the 192-cation cell, which can be calculated within a few hours on a modern supercomputer system.We refer the reader to Ref. [29] for further details including analysis of the influence of lattice relaxation, training set cell size, and a detailed comparison with random training set generation.

Hydration of Sc-doped BaZrO 3
The spinel DOI problem was chosen as our first demonstration because it was beneficial to have a benchmark result using a conventional method, i.e., cluster expansion.More recently, we have been applying abICS to much more difficult systems that would be virtually impossible to treat using cluster expansion.One such system is acceptor-doped BaZrO 3 , which is known as a promising proton conductor for fuel cell applications [3,4].
Doping of a trivalent cation such as Y and Sc on the tetravalent Zr site induces the formation of oxide ion vacancies to satisfy charge neutrality.These oxygen vacancies act as active centers for the hydration reaction under humid atmosphere, which results in the filling of oxygen vacancies and introduction of mobile protons into the system.It is well known that there is a strong sample dependence of the amount of water uptake in this class of materials, and researchers have been trying to understand this behavior in the hopes of optimizing the proton carrier concentration.
A hypothesis is that there are active and inactive oxygen vacancies depending on the vacancy environment, and the number of each environment would depend on the dopant type and processing conditions; when considering the nearest neighbor site, there are three types of oxygen vacancies in the Sc-doped case: Sc-V O -Sc, Sc-V O -Zr, and Zr-V O -Zr.From the experimental side, probing of the changes in the local environment upon hydration requires careful in situ measurements, and a clear picture is yet to be attained.From the theoretical side, it may seem straightforward to perform Monte Carlo calculations to simulate which vacancy types are filled as hydration proceeds, but this is very challenging because of the complexity of the material; Sc and Zr reside on the Zr site, O or V O reside on the O site, and protons or proton vacancies can be introduced in interstitial positions, making this a multilattice six-component configuration problem that is virtually impossible to handle using conventional methods.In Ref. [52], we used abICS to tackle this issue.Fig. 5 shows the extremely high prediction accuracy of our on-lattice neural network model for 22 at% Sc-doped BaZrO 3 with varying levels of hydration.The model was used to compute the hydration behavior as a function of hydration levels and temperature (Fig. 6), and careful comparison with in situ X-ray absorption and thermogravimetry measurements led to the following clarification: Water intake with decreasing temperature under constant vapor pressure starts with the filling of Sc-V O -Zr sites; once these sites are almost completely filled, Sc-V O -Sc sites start to be occupied.Zr-V O -Zr, which was predicted to be the most active from hydration enthalpy calculations, were shown to exist in negligible amounts under realistic equilibrium conditions and contributes very little to the hydration behavior.

Interfacial space charge at solid-state electrochemical interfaces
Another topic that we have been applying abICS to is the issue of the interfacial space charge at solid-solid interfaces.The concept of space charge is well known in semiconductor physics where a layer of electrons or holes is formed to align the Fermi level across interfaces.The concept can be extended to include ionic defects, which are often the dominant charge carriers when considering interfaces in, e.g., solid oxide fuel cells or solid state Li-ion batteries.Recent years have seen increasing interest in understanding and experimentally probing such effects in energy conversion devices [56][57][58][59], but success has been limited to very few state-of-the-art electron microscopy techniques [60].From the theory side, phenomenological modeling based on the Poisson-Boltzmann model has provided some insight [61][62][63][64].However, explicit atomistic modeling has been limited to examination of position-dependent defect formation energies in the dilute limit [65,66], and calculations at realistic carrier concentrations are lacking.This is because efficient thermodynamic sampling requires a lightweight effective model, and conventional models such as cluster expansion are very difficult to converge at surfaces and interfaces [67].Our preliminary calculations show that the on-lattice neural network model combined with high-throughput supercomputing may provide a solution.Fig. 7 shows the accuracy of the on-lattice model and calculated ion concentration profiles for a Pt (111)/Y 2 O 3 -doped ZrO 2 (yttria stabilized zirconia, or YSZ) slab model.YSZ is known as a well-studied oxide ion conductor for solid oxide fuel cell applications.The parent lattice is a cubic fluorite ZrO 2 with 68 Zr sites and 126 O sites, and configurations of 12 Y ions on Zr sites and 6 O vacancies residing on the O sites were sampled.The Pt slab contains 64 atoms.Although the attained accuracy of the on-lattice neural network model is not as good as the bulk systems introduced above, the root mean square error is < 2.5 meV/atom if we count all atoms (Pt, Zr, O, Y) and < 3.2 meV/atom if we only count the oxide atoms (Zr, O, Y), which is considered small enough to obtain sufficiently accurate thermodynamics.The calculated concentration profile at 1800 K, which corresponds to the sintering temperature, indicates segregation of Y dopants and O vacancies at the interface.The next challenge would be to consider the variation of O vacancy concentrations under varying oxygen partial pressures, which provide the chemical driving force for fuel cell operation.The free energy and grand canonical calculation methods implemented recently in abICS will enable such calculations.

Summary and Outlook
In this paper, we introduced the open source Python framework abICS for firstprinciples based statistical thermodynamics of lattice systems.A robust on-lattice machine learning energy model is trained against first-principles data through active learning iterations combined with parallel statistical sampling methods (RXMC or PAMC).This enables calculations on many-component many-sublattice systems and their interfaces, which are beyond the reach of conventional methods.We stress that abICS is still under heavy development.As already mentioned, grand canonical sampling is currently under testing.We are also considering the integration of various machine learning models which have been under heavy development in the past few years.Suggestions for new functionalities are welcome, and the modular object-oriented code structure should enable other developers to implement new functionalities and algorithms.With these developments, we are certain that abICS will contribute greatly to expanding the scope of first-principles materials modeling for understanding of the thermodynamics of complex materials.

Figure 1 .
Figure 1.A schematic of the framework implemented in abICS.

Figure 2 .
Figure 2. (a) Schematic figure of a unit cell comprised of a base structure and one defect sublattice, where three atom group types (a green atom, a vacancy, and an orange molecule) can occupy sites on the defect sublattice.(b) A schematic of Monte Carlo sampling on a 2 × 2 supercell of (a).The yellow shaded circles indicate updated sites.
An example of [sampling.solver]section is shown as follows:

Figure 3
Figure 3.The degree of inversion calculated using abICS on a 192-cation cell compared to the cluster expansion literature[53,54] and direct RXMC sampling in combination with first-principles calculation in a 48-cation cell[17].

Figure 5 .
Figure5.The NN energy model prediction vs. DFT energies in 22.2 at% Sc-doped BaZrO3 with varying levels of hydration.Reproduced from Ref.[52] with permission from the American Chemical Society.

Figure 6 .
Figure 6.Concentration of oxygen vacancies filled upon hydration at (a) 33% and (c) 66% in 22.2 at% Scdoped BaZrO3, and the occupancy change of each oxygen site at (b) 33% and (c) 66% hydration denoted by the size of the yellow spheres.Only the Zr (green)/Sc (purple) sublattice is shown in (b) and (d).Reproduced from Ref.[52] with permission from the American Chemical Society.

Figure 7 .
Figure 7. (a) Improvement of the on-lattice neural network model with active learning iterations and (b) Y and O vacancy concentration profiles in the calculated Pt/YSZ slab model.