True Gaussian mixture regression and genetic algorithm-based optimization with constraints for direct inverse analysis

ABSTRACT In molecular, material, and process design, Gaussian mixture regression (GMR) can directly predict values of explanatory variables x such as experimental and process conditions from target values of objective variables y such as properties and activities of materials. In GMR, a prediction result of a Gaussian mixture distributions is provided as a probability density distribution. The weighted average of the means of Gaussians has been used as the representative value of Gaussian mixture. In this study, true GMR has been proposed, where the predicted value of the GMR model is calculated by solving a continuous optimization problem to obtain the maximum value of the probability density function (PDF). As the continuous optimization problem cannot be solved when variables contain discrete values, a genetic algorithm is used to search for a solution that satisfies the constraints, while maximizing the value of the PDF. Numerical simulations using nonlinear functions between y and x, and the case studies using real material datasets confirmed that the proposed methods produced solutions with drastically higher values of the PDF than the traditional methods, especially when x is constrained in the inverse analysis. The Python codes of the proposed methods are freely available. Direct inverse analysis with GMR using the proposed methods is expected to improve the efficiency of molecular, material, and process design. Graphical abstract


Introduction
The designs of molecules, materials, and processes using machine learning have become common in recent years. Desirable molecules, materials, and processes are designed based on model y = f(x), which is constructed between explanatory variables x and objective variables y using molecular, material, and process data, respectively. Prediction of y values from x values is known as forward analysis of the model, and conversely, prediction of x values from y values is known as inverse analysis of the model. Molecular, material, and process design correspond to inverse analysis.
In general inverse analysis, a large number of samples of x are generated, while y values are predicted by inputting them into the model, and further, an appropriate sample is selected based on the predicted y values. More precisely, it is pseudo-inverse analysis, in which forward analysis is conducted multiple times. Bayesian optimization [1,2] in adaptive design of experiments is also a pseudo-inverse analysis, except that predicted y values are changed to values of an acquisition function. Optimization algorithms, such as a genetic algorithm (GA) [3] can be used to efficiently search for x values that result in a desired prediction of a y value and a maximum value of acquisition function.
However, using generative models constructed using Gaussian mixture regression (GMR) [4] and generative topographic mapping regression (GTMR) [5] allow us to directly predict x values from y values, which would be a direct inverse analysis, and not a pseudo-inverse analysis. In models constructed using GMR and GTMR, all relationships between x and y, i.e. joint probability distribution, are represented as a mixture of Gaussian distributions. Based on the joint probability distribution, the posterior probability density distribution of x can be calculated, which is also represented as a mixture of Gaussians, for given y values using multiplication theorem on probability and Bayes' theorem. The direct inverse analysis using GMR exhibited more efficiency in adaptive design of experiments, than Bayesian optimization [4]. GMR and GTMR are equipped to handle multiple y variables [6], and GTMR can visualize a dataset simultaneously, along with forward and inverse analyses of the model.
In GMR and GTMR, x and y are handled equally as variables, and thus, prediction of x values from y values, and prediction of y values from x values are performed in the same manner. Yamada et al. investigated signal deconvolution methods using a shorttime Fourier transform and a non-negative tensor/ matrix factorization as well as methods for predicting NMR signals and physical properties using GTMR [7]. Wang et al. predicted continuous tool wear process using GMR [8]. Yao and Ge used GMR for oxygen content prediction of a nonlinear multimode primary reformer [9]. Ye et al. proposed a closed-loop framework of robot learning through demonstration based on the bagging method of GMR to obtain a robust learner of robot learning, with a high-precision reproduction [10]. Zheng et al. used GMR for modeling and optimizing the cement calcination process for reducing NOx emissions [11]. Kaneko proposed extended GMR [12], and latent variables-based GMR [13] to improve the predictive ability of forward and inverse analyses of GMR models.
While predicting y values from x values and x values from y values, they are both represented with mixture of Gaussians as joint probability distributions of y and x, respectively. To provide predicted 'values' for x and y, we are essentially required to calculate x and y values that have the highest probability density of the Gaussian mixture, respectively. However, in actuality, we simply use the weighted average of the mean of each Gaussian using weights as the predicted values of x and y. These values do not necessarily indicate the highest value of the probability density function (PDF) or the maximum value of the conditional probability density function. Therefore, in this study, true GMR is proposed to search for the predicted values of x and y, such that the value of the PDF is maximized. The PDF obtained using the Gaussian mixture is used as a nonlinear function to solve the continuous nonlinear optimization problem. This enables the values to maximize the true probability density of the GMR model while predicting y values from x values in forward analysis as well as the x values with the highest probability of achieving target y values, to be predicted in inverse analysis.
In inverse analysis, there can exist constraints on x. For example, for mole fractions, the sum of all the mole fractions must equal to one. Additionally, upper and lower limits exist for temperature and pressure, depending upon durability and safety of an equipment. Based on experimental conditions, discrete variables can be included in x, for example, presence or absence of an additive. True GMR can be applied by solving the continuous nonlinear optimization problem described before, when the constraints are upper and lower limits, and can be represented using linear and nonlinear equations; however, especially when discrete variables exist in x, the continuous nonlinear optimization problem cannot be solved. In this study, a search for x values with high values of PDF through utilization of a metaheuristics algorithm, especially GA, has been proposed. In GA, by preparing chromosomes to satisfy constraints and calculating the fitness of PDF, x values can be searched with a high value of PDF, thus satisfying the constraints on x. The proposed method is known as GA-based optimization with constraints using GMR (GAOC-GMR).
In this study, the proposed methods have been validated using complex nonlinear functions that facilitate calculation of y from x. Following the construction of the GMR model, the effectiveness of the proposed methods was confirmed by comparing values of the PDF for x values obtained using the traditional methods, with those for x values obtained using the proposed methods.

Method
Following the explanation of GMR and the traditional methods of calculating values predicted using the GMR model, the proposed methods of true GMR and GAOC-GMR have been described.

Gaussian mixture regression
GMR is a regression analysis approach based on the use of Gaussian mixture model (GMM) [4,[12][13][14]. It constructs a model that expresses a dataset by superimposing multiple Gaussian distributions.
For a sample x ϵ R m (where m is the number of variables in x), the probability distribution function p (x) of the GMM is given as follows: where, n is the number of Gaussians, and μ k and Σ k are the mean vector and variance-covariance matrix of the kth Gaussian, respectively. π k is the weight of the kth Gaussian, provided as follows: The expectation-maximization algorithm is used to maximize log-likelihood, and μ k , Σ k , and π k can be determined. The log-likelihood function is given as follows: where, N is the number of samples.
In GMR, the joint probability distribution of x and y is determined by calculating the GMM using a dataset that combines sample y ϵ R p of y (p is the number of y variables) and sample x of x. Based on the multiplication theorem of probability and Bayes' theorem, the y values can be predicted by calculating the posterior probability distribution p(y|x) when x is given; conversely, the x values can be predicted by calculating the posterior probability distribution p(x|y) when y is given. In other words, direct inverse analysis of the model is possible.
By explicitly separating x and y in Equation (1) of the GMM, the joint probability distribution of x and y is given as follows: where, μ x,i and μ y,i are the mean vectors, and Σ xx,i and Σ yy,i are the variance-covariance matrices of x and y for the ith Gaussian, respectively, while Σ xy,i and Σ yx,i are the covariance matrices of x and y for the ith normal distribution, respectively. Estimating x from y corresponds to determining the posterior probability distribution p(x|y) of x for a given y, and p(x|y) can be transformed as follows using the multiplication theorem of probability and Bayes' theorem: where, p(x|y, μ y,i , Σ yy,i ) is the multivariate Gaussian distribution of the estimated x value for the ith Gaussian and w y,i is the weight of this multivariate Gaussian distribution. For p(x|y, μ y,i , Σ yy,i ), the mean vector m i (y) and the variance-covariance matrix S i (y) are given as follows: In the above explanation, swapping x and y allows y to be estimated from x. Because GMR does not distinguish between x-variables and y-variables, GMR can handle both multiple x-variables and multiple y-variables for both regression and inverse analyses. In this study, the GMM calculation applied the Gaussian Mixture [15] from the scikit-learn library; the GMR calculation used DCEKit [16]. The hyperparameters in GMR are the number of Gaussian, and the type of variance-covariance matrix. In addition, there exist two traditional methods of estimating y from x (or x from y): 1) weighted mean: calculate the weighted mean using w y,i in Equations (6), and 2) max mean: use the mean of the Gaussian that maximizes w y,i in Equation (6).

True GMR
The prediction result of GMR is given as a PDF, which is a mixture of Gaussians described in Equations (5)- (8). To obtain predicted 'values' from the PDF, weighted mean and max mean for mixture of Gaussians given in section 2.1 were used traditionally. However, both, weighted mean and max mean do not consider the variance-covariance matrix of Equation (8). Furthermore, weighted mean and max mean do not provide x and y values at which the PDF in Equation (5) attains its maximum value.
In the proposed true GMR, the PDF in Equation (5) is handled as a continuous nonlinear function, and x and y values that provide the maximum value of the PDF are solved as a continuous nonlinear optimization problem. Figure 1 shows the basic concept of true GMR. The red, blue, and green lines denote probability density of three Gaussians whose weights are 0.20, 0.41, and 0.39, respectively, while the black line denotes probability density of a Gaussians mixture, which implies the true probability density of GMR given as Equation (5). Although, weighted mean and max mean do not follow the probability density, true GMR indicates the highest value of the probability density. Figure 1 confirms the performance of true GMR.
To solve a continuous nonlinear optimization problem, scipy.optimize.minimize [17] was used in Python. In this study, sequential least squares programming [18], an iterative solution method for nonlinear optimization, has been used as a continuous optimization method. Sequential least squares programming solves a sequence of optimization subproblems, each of which optimizes a quadratic model of the objective subject to a linearization of the constraints. Since the weighted mean and max mean exist as the initial values, the problem is solved using each, and the solution with the largest probability density between the two results is adopted.
The proposed method can be used to find a solution if x variables have continuous values and the constraints are upper and lower bounds, and can be represented by linear and nonlinear equations, if constraints on x exist. However, true GMR cannot be applied if variables include a discrete variable. For example, true GMR cannot be used in the presence or absence of an additive (0 or 1) in polymer polymerization, and to change the temperature by only 1 K or 5 K due to equipment limitations.

Genetic algorithm-based optimization with constraints using GMR (GAOC-GMR)
It is proposed to acquire x and y values that provide the maximum value of the PDF in Equation (5) using GA, which is a metaheuristic inspired by the natural selection process. GA prepares several chromosomes representing candidates for x and y values as genes, and searches for solutions by selecting chromosomes with high fitness values preferentially, and repeating operations, such as crossover and mutation. In this study, the fitness in GA is defined as the logarithm of Equation (5). Various constraints on x in inverse analysis of GMR model can be considered by devising chromosomes in GA. The following constraints on x can be considered:  (2) are dummy variables, such as presence or absence of an additive, washing, and pretreatment of raw materials. All categorical variables can be represented using dummy variables. Variables with values other than 1 can be converted to 0 or 1 by scaling. In (3), when considering mole fractions of raw materials while mixing several raw materials, it is necessary to total them up to 1. Some sets of variables can have different total values. As given in (4), the total value of several variables are required to be in a certain range. When a categorical variable with multiple categories is transformed into dummy variables, only one of the dummy variables is required to be 1 and the others must be 0, as in (5). When m 6 raw materials are selected for use, only m 6 variables would have nonzero values, while the others would be 0, as in (6). In (7), while preparing raw materials, it is necessary to round off the values, as the raw materials can only be adjusted to the m 7 th decimal place, and temperature and pressure can only be changed in certain units depending upon equipment. Therefore, the problem is categorized as discrete optimization, or combinatorial optimization.
In this study, the above constraints on x and y are considered while generating chromosomes in GA. Figure 2 shows the flow of setting up the chromosome in GAOC-GMR. The initial chromosomes are generated with a uniform random number between the lower and upper limits for each variable. In the initial chromosomes and the chromosomes generated in the GA operation, the values are first converted to 0 or 1 by rounding off in constraint 2). Next, in constraint 5), among the values of the chromosomes corresponding to each set of the assigned variables, up to m 5 variables in descending order, of the values are converted to n 5 , while the rest of the variables in the set are converted to 0. Further, in constraint 6), among the values of the chromosomes corresponding to each set of the assigned variables, up to m 6 variables in descending order of the values remain the same, while the rest of the variables in the set are converted to 0. Next, in constraint 3), the values of the chromosomes corresponding to each set of the assigned variables are totaled, and divided by the sum, and further multiplied by m 3 . Finally, in constraint 7), the values are rounded off to the nearest digit. The fitness value of each chromosome that does not satisfy the constraints 1) and 4) is set to a very low value such as −10 10 , while the fitness values are calculated for the other chromosomes. Next, x values that provide the maximum value of the PDF in Equation (5) are determined in GA. This method can be used for discrete optimization, or combinatorial optimization. In this study, GAOC-GMR was run 30 times to avoid settling on a local optimum solution, and the x with the highest PDF value was used.

Results and discussion
To test the performance of the proposed true GMR and GAOC-GMR, forward and inverse analyses were conducted assuming complex nonlinear functions between x and y to be a system, where y can be obtained as a result of experiments based on x. The nonlinear functions are based on the Rastrigin function [20] and Griewank function [21]. These are represented as follows: y ¼ 1 4000 where, d is the number of variables in x, and x i is the ith x.
In this study, the following six case studies were conducted:  The number of Gaussians is 20, and the type of variance-covariance matrix is 'full' in scikit-learn's sklearn.mixture.GaussianMixture [15] in GMR, and further, 1000 samples generated by random numbers were used to construct the GMR model for each case study.
There was no significant difference between the estimation results of weighted mean, max mean, and true GMR in forward analysis of the GMR models. In the supplementary material, estimation results using weighted mean, max mean, and true GMR in case studies 1, 2, and 3 are shown in Table S1, and the plots of actual y values and estimated y values in case studies 1, 2, and 3 are shown in Figures S1, S2, and S3, respectively. It was confirmed that true GMR can estimate y values with the same level of accuracy as the traditional methods, in forward analysis.
Then, inverse analysis of the GMR model was conducted to compare the performance of the traditional methods with that of the proposed methods. In case studies 1 and 4, the target y values were set to 200, 300, 400, 500, 600, and 700, where 200 and 700 indicated the extrapolation domain of y. In case studies 2 and 5, the target y values were set to 100, 300, 500, 700, 900, 1100, 1300, and 1500, where 100 and 1500 indicated the extrapolation domain of y. In case studies 3 and 6, the target values of (y1, y2) were set to (100, 100), (200, 300), (300, 700), (400, 500), (500, 800), (600, 1200), and (700, 1400), where (100, 100) and (700, 1400) indicated the extrapolation region of y. Tables 1-3 show logarithm of probability density for each target y value in case studies 1, 2, and 3, respectively. True GMR obtained x values with a probability density higher than that of max mean. Furthermore, for certain target y values, true GMR obtained x values with a probability density higher than that of weighted mean. However, for the other multiple target y values, the probability density of the proposed method was the same as that of weighted mean and max mean. This can be attributed to the fact that when the weights are concentrated in a Gaussian distribution, the probability   density of max mean is the global maximum, which in itself is not an issue. However, continuous nonlinear optimization was more complicated in the case studies due to the large number of variables, such as 25 and 30 variables in true GMR. GAOC-GMR improved the values of the probability density of weighted mean, max mean, and true GMR for certain target y values. In true GMR, reasonable x values of weighted mean and those of max mean are used as initial values, and thus, x values with a high probability density are obtained stably. However, the global optimal solution is not guaranteed. Even when true GMR attains a local optimal solution, GAOC-GMR might obtain x values with a higher probability density. However, as GAOC-GMR is a heuristic method, the probability density may be lower than that of true GMR, in which case it is better to use the result of true GMR. It was confirmed that both true GMR and GAOC-GMR can be used to perform inverse analysis of the GMR model by predicting x values that result in a stably high probability density value.
The results of inverse analysis of the GMR models for case studies 4, 5, and 6, where x is constrained, are shown in Tables 4-6, respectively. In the cases of weighted mean and max mean, following the prediction of x values, they were set to be the closest to the constraints on x. From Tables 4-6, weighted mean and max mean resulted in much lower values of probability density due to the constraints on x. However, using GAOC-GMR, the values of probability density were improved even when the constraints on x existed. Furthermore, even when the y values were in the extrapolation region, x values with high probability density values were obtained, and the inverse analysis was stable. It was confirmed that the proposed method could be successfully used to perform inverse analysis of the GMR models.
Further, to test the performance of the proposed method, inverse analysis was conducted for the real material datasets. The datasets of thermoelectric conversion (ZT) materials [22] and superconducting (Tc) materials [23,24] were used. In Tc, the number of samples is 21263, x is the fraction of each metal element, and y is the critical temperature. In ZT, the number of samples is 1394, x is the fraction of each metal element, and y is the efficiency of thermoelectric conversion [25], which is calculated as: where, κ, S, and σ are the thermal conductivity, the Seebeck coefficient, and the electrical conductivity, respectively, while T is the average temperature of the material proportional to conversion efficiency. Fivefold cross-validation was performed to determine the hyperparameters for GMR. The number of Gaussian was 28 and the type of variance-covariance matrix was 'spherical' in scikit-learn's sklearn.mixture.GaussianMixture [15] in ZT, and the number of Gaussian was 29 and the type of variance-covariance matrix was 'spherical' in Tc. The purpose of this study is to determine a large probability density in the probability distribution that is obtained through the prediction results of forward and inverse analyses in GMR, and not to improve the predictive ability of GMR. The predictive ability of GMR in ZT and Tc has already been discussed in the literature [6,12,13]. Figure 3 shows histograms of ZT and Tc. The target values for ZT and Tc were set to 1.5 and 150, respectively, and inverse analysis of the GMR model was conducted.  The sum of the fractions of metal elements was constrained to be one, and the fraction of each metal element was constrained to have values rounded off to the third decimal place. To set constraints on x, the number of metals should be set first. In this study, the number of metal element types used was changed to 2, 3, 4, 5, 6, 7, 8, 9, and 10, and the fraction of the metals were designed with GAOC-GMR. Tables 7 and 8 show logarithm of probability density for the target y value in ZT and Tc, respectively. GAOC-GMR successfully improved the values of probability density for all numbers of metal elements in both, ZT and Tc, even when the y values were in the extrapolation region. It was confirmed that GAOC-GMR could be used to perform inverse analysis of GMR model by predicting x values that result in a stably high probability density value.

Conclusions
In this study, we developed the methods for accurate calculation of predicted 'values' for y and x in forward and inverse analyses of GMR models, respectively, while the prediction results were given by Gaussian mixtures. Although predicted values were conventionally approximated by weighted mean and max mean, we proposed true GMR, a method in which the point with the highest value of the PDF is solved using continuous nonlinear optimization and used as the prediction result. Furthermore, when discrete variables exist, GAOC-GMR can be used to search for the point that maximizes the PDF by appropriately combining chromosomes with other constraints, and using the value of the PDF as the fitness in GA. True GMR can be executed only when constraints on variables are upper and lower bounds and represented as linear and nonlinear equations; however, GAOC-GMR can be executed with any constraints, including discrete variables. The effectiveness of the proposed methods was verified using complex nonlinear functions between x and y, where x was assumed to be the experimental conditions and y was assumed to be the experimental results, and real material datasets. It was confirmed that true GMR could provide solutions with higher values of the probability density, than traditional methods. Furthermore, GAOC-GMR possesses the ability to search for much higher values of the probability density even when various constraints are present, and target y values are in the extrapolation region. In true GMR, reasonable x values of weighted mean and those of max mean are used as initial values, and thus, x values with a high probability density are obtained stably. However, a global optimal solution is not guaranteed. Even when true GMR falls into a local optimal solution, GAOC-GMR would obtain x values with the higher probability density. However, since GAOC-GMR is a heuristic method, the probability density can be lower than that of true GMR, in which case it is better to use the result of true GMR. Therefore, both, true GMR and GAOC-GMR can be used to perform inverse analysis of the GMR model by predicting x values that result in a stably high probability density value.
Direct inverse analysis with GMR using the proposed methods is expected to improve the efficiency of molecular, material, and process design. The Python codes for true GMR and GAOC-GMR are available at https://github.com/hkaneko1985/dcekit.

Disclosure statement
No potential conflict of interest was reported by the author(s).