Experimental design for the highly accurate prediction of material properties using descriptors obtained by measurement

ABSTRACT In materials science, both controllable and uncontrollable descriptors can be used to characterize materials. Examples of controllable descriptors include the composition of elements and fabrication processes; in contrast, uncontrollable descriptors are generated by experimental data characterizing particular samples, such as raw spectral data or specific gravity. In this study, we consider an experimental design to obtain a highly accurate prediction model where the uncontrollable descriptors of materials are features and its material properties are labels. In general, as uncontrollable descriptors are more closely related to material properties, predictions based on them will be more accurate. The goal of the experimental design in the present study is not the improvement of the material properties as such but the prediction of their properties. To realize this design, we select appropriate controllable descriptors for the synthesis of the candidate material that improve the prediction accuracy when the corresponding uncontrollable descriptors and material properties are added to the training data. We propose two experimental design methods, one based on Bayesian optimization and the other on uncertainty sampling. Using a polymer database in which controllable and uncontrollable descriptors, and mechanical properties are recorded, we confirm that our method can select an appropriate candidate material to train a highly accurate prediction model in which the material properties are predicted by uncontrollable descriptors. Our proposed method can be applied to materials developments where uncontrollable descriptors are more easily obtained by experiments than obtaining target material properties; it will also be useful for extracting the relationship between structure and properties of a material. GRAPHICAL ABSTRACT


Introduction
Various successful applications of machine learning have been reported in materials science [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18]. In particular, predictions using machine learning regression models is powerful for materials science because properties of materials that have not yet been synthesized can be predicted without their actual synthesis and measurements. To successfully conduct research using machine learning predictions, it is important to construct a highly accurate regression model with a training dataset. In general, the accuracy of the regression model is strongly dependent on the amount of training data. However, in materials science, obtaining data requires a significant amount of time and high costs. Thus, it is challenging to create accurate prediction models with as little training data as possible.
A reliable strategy for obtaining an accurate prediction model is feature selection, which searches for important features that increase prediction performance [19][20][21][22]. In this study, however we focus on another strategy: active learning, which selects new data points to be added to the training dataset to improve the prediction accuracy. Here, new data is selected from the candidate materials, which have not been synthesized, prepared in advance. Uncertainty sampling in active learning is useful: the datapoint with the highest uncertainties is selected to improve the prediction accuracy [23][24][25][26][27]. For example, these uncertainties correspond to the deviations of the predicted values, evaluated by such as Gaussian process regression.
However, uncertainty sampling is not always possible for problems in the materials science. As a characteristic of materials science, two types of material descriptors can be used as features when training the regression model (see Figure 1). The first type is controllable descriptors, which can be chosen in advance without synthesizing the material, such as compositions of elements and fabrication processes. The second type is uncontrollable descriptors, which are those generated from experimental data such as raw spectral data and specific gravity of samples and are thus unknown without synthesis. If the features used for regression models are limited to controllable descriptors, a new point to improve the prediction accuracy can be selected by conventional uncertainty sampling. This is because, when the features are controllable descriptors, their ground-truth values are known in advance, and thus the uncertainty (as evaluated by a specific machine-learning prediction) is exactly defined, even if the material has not actually been synthesized. On the other hand, when uncontrollable descriptors are used as features, the situation changes completely: as the candidate dataset cannot be prepared for uncontrollable descriptors without synthesis and measurements, the selection of uncontrollable descriptors by conventional uncertainty sampling is impossible.
This study seeks to obtain a highly accurate prediction model of material properties from uncontrollable descriptors with as few experiments as possible. In general, the dimensions of uncontrollable descriptors generated by experimental data can be increased by performing various measurements, and uncontrollable descriptors are more closely related to material properties in many cases. Therefore, using uncontrollable descriptors should improve prediction accuracy. In addition, if uncontrollable descriptors are obtained Figure 1. Examples of controllable and uncontrollable descriptors, and material properties. In this study, we develop an experimental design for obtaining an accurate prediction model where material properties are predicted by uncontrollable descriptors; i.e. we design a method to suggest controllable descriptors for synthesis and measurement to improve the prediction accuracy. more easily than the target material properties, difficult-to-measure experimental data can be predicted from easy-to-measure ones. This model can be used in various ways (e.g. materials optimization and repurposing of materials), as will be discussed in Sec. 5. Furthermore, using the prediction model, the relationship between structure expressed by uncontrollable descriptors and properties of a material can be understood.
The methodology of the experimental design proposed in this study is as follows. First, we prepare a candidate dataset that is constructed using only controllable descriptors. When a controllable descriptor is selected, the uncontrollable descriptors and target material properties are obtained by synthesizing the material and conducting measurements. Then, we train a regression model to predict the target property from the uncontrollable descriptors without using the controllable ones. Next, we select appropriate controllable descriptors from the candidate dataset such that the accuracy of the prediction model using the uncontrollable descriptors will be high. Unfortunately, the conventional uncertainty sampling method cannot be used directly in this step and thus development of a new method is necessary. We propose two methods for using uncontrollable descriptors in active learning, one based on Bayesian optimization and one based on uncertainty sampling. To demonstrate the performance of these methods, a polypropylene dataset is used in this study.

Problem establishment
The controllable and uncontrollable descriptors are represented by x i 2 R d and x 0 i 2 R d 0 , respectively; y i 2 R is the target material property. First, we prepare a dataset constructed using only the controllable descriptors of the candidate materials; this is expressed as To improve the prediction performance, we select the next appropriate candidate material characterized by the controllable descriptors from the remaining dataset in D c nD c;M , where the number of data points is given by N À M. The dataset for the controllable descriptors and target properties, when the number of data points is M, is denoted as D t;M ¼ x k ; y k f g k¼1;...;M .

Bayesian optimization-based method
In this section, we introduce a method based on Bayesian optimization. The following steps are taken to select the M þ 1 st data using this scheme. First, the training dataset, D 0 t;M , is divided randomly into a set of M À L data and a set of L data, which are denoted as D t;M would improve this approximation. However, the Gaussian process regression has L training data, so decreasing L will worsen its prediction performance. Hence, L should be adjusted in a trade-off to find the best value. The pseudo-code of this Bayesian optimization-based experimental-design method (BOED) is shown in Algorithm 1, and the schematic of BOED is shown in Figure 2, where the M datapoints are the known data.

Algorithm 1 Bayesian optimization-based experimental design (BOED) method
Prepare candidate dataset: Evaluate the prediction accuracy when the number of training data is M þ 1 End for

Uncertainty sampling-based method
In the alternative method based on uncertainty sampling, the M þ 1 st data is selected using the following steps. First, the Gaussian process regression is trained using dataset D t;M , which contains M training data. Second, a single datapoint is selected from D c nD c;M , with the largest deviation evaluated by the trained Gaussian process; this step is inspired by conventional uncertainty sampling. Third, by experimentally obtaining the uncontrollable descriptors and material properties for the selected data, the number of known data becomes M þ 1. Thus, the datasets are updated to D c;Mþ1 , D t;Mþ1 , and D 0 t;Mþ1 . We train a regression model using the training dataset D 0 t;Mþ1 , where the uncontrollable descriptors are features and the material properties are labels. Then, we evaluate the prediction accuracy of this regression model.
This method selects the M þ 1 st data by choosing which datapoint has the highest uncertainty for the trained Gaussian process regression where the material properties are predicted by controllable descriptors. The original goal is to select the most uncertain data for a regression model in which the material properties are predicted by uncontrollable descriptors. If there is any correlation between the controllable and uncontrollable descriptors, this algorithm will be successful. The pseudo-code of this uncertainty sampling-based experimental design method (USED) is shown in Algorithm 2. Evaluate the prediction accuracy when the number of training data is M þ 1 end for

Target dataset
To demonstrate the efficiency of our proposed methods in Section 2, 75 types of polypropylene data are used. The material properties outlined here are the mechanical properties of the Charpy impact test and the tensile modulus. The three controllable descriptors for the polymers are set to the molecular weight, mmmm in tacticity, and the injection-cooling temperature. The uncontrollable descriptors are prepared from differential scanning calorimetry (DSC), wide-angle X-ray diffraction (WAXD), pulse nuclear magnetic resonance (pulse-NMR), the skin layer observed by polarizing optical microscopy, and specific gravity. In these measured data, information about the higher-order structure of polymers is included.
In this study, we verify the efficiency for the case where the prediction of mechanical properties is achieved from uncontrollable descriptors. This is because, if this prediction model is poor, appropriate selection is impossible. Thus, some uncontrollable descriptors are selected in advance. Specifically, the Pearson correlation coefficients between each experimental data and the mechanical property are evaluated, and the top 10 descriptors are extracted for each mechanical property. Two investigations are executed, using the 1st to 5th and 6th to 10th uncontrollable descriptors in descending order, respectively. In Section 4, the results of the experimental design for the case of the three-dimensional controllable descriptors and five-dimensional uncontrollable descriptors are shown.

Comparison of our algorithms and random sampling
We execute the experimental designs described above using the proposed methods and the polymer database. In our implementation, we use the Python library scikit-learn [28]. Elastic net regression is used when the mechanical property is predicted by uncontrollable descriptors. In the Supplementary Information, the results of the Gaussian process regression and partial least-squares (PLS) regression are shown. First, we address the performance of regression models when all 75 data points are used as the training data. Figure 3 shows scatter plots of the real and predicted mechanical properties of the test data depending on the features used, i.e. the controllable, 1st to 5th uncontrollable, and 6th to 10th uncontrollable descriptors. The test data is prepared using leave-one-out cross validation, and the prediction results are plotted. The MSE is evaluated for the test data corresponding to the leaveone-out error (LOOE), which is indicated in Figure 3. In addition, the coefficient of determination, R 2 , for the whole dataset is evaluated and is also denoted in Figure 3. The prediction accuracy for both metrics is better with the 1st to 5th uncontrollable descriptors than with the 6th to 10th descriptors. For the Charpy impact test, the controllable descriptors are better suited as features than uncontrollable descriptors; on the other hand, the uncontrollable descriptors exhibit a higher prediction accuracy for the tensile modulus. These prediction accuracies are the higher bounds of the prediction performance in our experimental design. Figure 4 shows the prediction accuracies, which are defined by the LOOE calculated using the MSE and R 2 for predictions of the whole dataset (i.e. all 75 data points), depending on the iteration step, m, of the experimental design when the training data is D Here, the number of initial data points is 10, where these initial data are randomly selected from D c . Then, in each iteration step, m, the number of training datapoints is M ¼ m þ 10. In addition, 100 runs of the experimental design with different initial choices are performed and the prediction accuracies are averaged. The results using BOED and USED are compared with those obtained when the next candidate material is randomly selected from the remaining candidates. For BOED, we consider the case where the hyperparameter L is set to an integer of M=3, so that the value of L increases with the number of steps.
For the Charpy impact test, our proposed methods rapidly obtain highly accurate prediction models with a smaller amount of training data than that for random sampling for both accuracies. This conclusion is the same regardless of which database (i.e. 1st to 5th or 6th to 10th) is used. On the other hand, for the tensile modulus, our proposed methods give better results than random sampling when using the 1st to 5th uncontrollable descriptors; however, when using the 6th to 10th descriptors, our methods do not improve the accuracy. Because the tensile modulus is difficult to predict using the 6th to 10th uncontrollable descriptors (see Figure 3 (b)), appropriate selection does not work well in this case. Thus, our proposed method is more useful when an accurate prediction is possible. Furthermore, USED is better than BOED for predicting the tensile modulus. From the results of the Charpy impact test, USED is generally effective for the prepared polymer database. Possible there is some correlation between the controllable and uncontrollable descriptors in the database, allowing USED to work well. From these results, we conclude that both experimental design methods are effective in creating a better prediction model with a small amount of training data if an accurate model can be achieved.

Hyperparameter dependency of the BOED
In this section, we address the dependence of the hyperparameter L on the prediction accuracies in the experimental design. As explained in Section 2.2, finding the optimal value of L involves a trade-off between approximation quality (i.e. distance of D 0 t;MÀ L from D 0 t;M ) and number of training data in the Gaussian process. In Figure 5, we show the prediction accuracies depending on the iteration step when L is changed to integers of M=2, M=3, M=4, and M=5. In almost all cases, for a smaller value of L, a better accuracy is realized. For M=2, which has the largest number of training data in the Gaussian process, the worst accuracy is obtained. Thus, in the trade-off, the improvement of the approximation is more important than the increase in the amount of training data. The number of initial data is set at 10 and thus L cannot be reduced further. If the amount of data is increased, it is expected that better results will be obtained by controlling L.

Discussion and summary
We proposed methods for conducting experimental designs to improve the prediction accuracy of a regression model where the properties of material are predicted by descriptors obtained by measurement. In other words, this study considered the treatment of uncontrollable descriptors, which are those generated from experimental data such as raw spectra and specific gravity and are thus unknown without synthesis, as features for prediction models in active learning. The methodology of the study is as follows. We focus on the situation in which if controllable descriptors, such as composition and processes, are determined, corresponding uncontrollable descriptors and target material properties are obtained by synthesis and measurements. Then, the material expressed by controllable descriptors to be synthesized is appropriately selected to improve the accuracy of the regression model for predicting material property by uncontrollable descriptors.
Two methods (based on Bayesian optimization and uncertainty sampling, respectively) were proposed, and their efficiencies were addressed. We used the polypropylene database and the Charpy impact test and tensile modulus as target properties. Our proposed experimental designs could appropriately select the next candidate material to train a highly accurate prediction model where the material properties are predicted by the uncontrollable descriptors. With a small number of training data, better predictions were realized than by random sampling. We found that the proposed method works better when regression models with better prediction performances will be achieved. In the main text, the results of the elastic net regression were introduced; in Figures. S1-6, in the Supplementary Information, the results of the Gaussian process and PLS regression are shown. We confirm that our experimental design works well when these regression methods are used. Note that although we used only uncontrollable descriptors as features for the prediction model in this study, our method can be also applied to create a better prediction model using both controllable and uncontrollable descriptors.
If the uncontrollable descriptors are obtained more easily than the target material properties, the prediction of difficult-to-measure experimental data from easy-to-measure ones can be achieved by our experimental design, accelerating materials design and reducing the cost of materials development. For example, materials optimization can be conducted by using our prediction model in the determination of target material properties. Note that such materials optimization can also be performed using only controllable descriptors. However, there are many more uncontrollable than controllable descriptors, and uncontrollable descriptors are more closely related to material properties in many cases. Therefore, using uncontrollable descriptors allows better prediction of material properties and more effective optimizations. Also, if an accurate prediction model can be obtained, materials can be repurposed, and databases including uncontrollable descriptors (such as existing XRD databases) can be screened using the prediction model to discover materials having desired target properties.
From another perspective, using the prediction model and collecting data by our experimental design can lead to deeper understanding of materials. This is true even in the case that both uncontrollable descriptors and material properties are difficult to measure. For example, the origin of materials' properties can be clarified by extracting relations between them and the materials' structure expressed by uncontrollable descriptors.
In addition, because our method can reduce the number of experiments required for a good prediction model, it is well-suited for use with laboratory automation technologies [29][30][31], which have recently attracted much attention. Furthermore, even when target material properties themselves cannot be measured robotically, materials optimization can be conducted automatically by replacing the determination of the target material property with the prediction obtained by our experimental design.
This study had a preliminary character, proposing the method and demonstrating its usefulness using a specific polymer database. Future aims include the execution of the real experimental design without the use of the database and the actual experimental generation of a dataset with the help of our proposed methods. To consider a case in which a highly accurate prediction model could be obtained, we selected and used uncontrollable descriptors that were known in advance to have strong correlations with the target property. However, when designing actual experiments, useful uncontrollable descriptors might not be known ahead of time, and thus further improvement of the method is necessary for practical use.
There are both controllable and uncontrollable descriptors; the need to predict properties with high accuracy from uncontrollable ones occurs frequently throughout materials science. We believe that our proposed method, even if it is not yet optimal, points to one way that this unique problem can be solved. We expect our study to stimulate the development of a method suitable for this unique problem.