Modeling of carbon dioxide solubility in ionic liquids based on group method of data handling

Due to industrial development, the volume of carbon dioxide (CO2) is rapidly increasing.. Several techniques have been used to eliminate CO2 from the output gas mixtures. One of these methods is CO2 capturing by ionic liquids (ILs). Computational models for estimating the CO2 solubility in ILS is of utmost importance. In this research, a white box model in the form of a mathematical correlation using the largest data bank in literature is presented by the group method of data handling (GMDH). This research investigates the application of GMDH intelligent method as a powerful computational approach for predicting CO2 solubility in different ionic liquids with temperature lower and upper than 324 K. In this regard, 4726 data points including the solubility of CO2 in 60 ILs were used for model development Moreover, seven different ionic liquids were selected to perform the external test. To evaluate the validity and efficiency of the suggested model, regression analysis was implemented on the actual and estimated target values. As a result, a proper fit between the experimental and predicted data was obtained and presented by various figures and statistical parameters. It is also worth noting that the predicted negative values in the proposed models are considered zero. Also, the results of the established correlation were compared to other proposed models exist in the literature of ionic liquids. The terminal form of the models suggested by GMDH approach and obtained based on temperature are two simple mathematical correlations by exerting input parameters of temperature (T), pressure (P), critical temperature (Tc ), critical pressure (Pc ) and, acentric factor (ω) which does not suffer from the black box property of other neural network algorithms. The model suggested in this work, would be a promising one which can act as an efficient predictor for CO2 solubility estimation in ILs and is capable of being used in different industries.


Introduction
The world has faced an unprecedented and increasing utility of fossil fuels containing noticeable carbon content, due to human activities in recent years. These fuels are introduced as the main reason for greenhouse gas emissions. The well-known expression 'greenhouse gas', belongs to some particular atmospheric gases with a thermal infrared range between 5.6 µm and 1 cm (Ai et al., 2005;Baghban et al., 2016;Deolalkar et al., 2015;Saeidi et al., 2014;Tuckett, 2016). Several gases presented in the atmosphere can be considered as a member of greenhouse gases family (Sahoo & Ray, 2006). Carbon dioxide, as one of them, can result in destructive changes in CONTACT Amir Mosavi amir.mosavi@mailbox.tu-dresden.de; Abdolhossein Hemmati-Sarapardeh hemmati@uk.ac.ir; Shahab S. Band shamshirbandshahaboddin@duytan.edu.vn, shamshirbands@yuntech.edu.tw climate and is essential to be eliminated from the gas mixtures of greenhouse gas resources (Baghban et al., 2015). Due to global warming and environmental problems caused by CO 2 emission, using new strategies and cost-effective technologies for the removal of acid gases from their sources would be essential in the future (Adger & Brown, 1994;Nash & Lumetta, 2011). Up to now, several methods and technologies have been introduced to remove carbon dioxide from the flue gas mixtures produced by fossil fuels. The removal technologies consist of a (1) Selective carbon dioxide capturing from the gas mixtures.
(2) Conversion of pure carbon dioxide into the supercritical state by compression. (3) Transferring and injection of the removed and compressed CO 2 to a permanent underground or submarine storage reservoir (Soltanian et al., 2019;. Carbon dioxide capturing technologies with their advantages and disadvantages are divided into three categories: oxyfuel, pre-combustion, and post-combustion. In the oxyfuel CO 2 removal method, the air is not used during the combustion and pure oxygen is used instead. In the pre-combustion technique, CO 2 separation from the gas components is performed before the combustion process. On the contrary, the CO 2 removal process is after combustion occurrence in the post-combustion method. As the most utilized carbon dioxide capturing technology in the industry, the post-combustion method, by itself, is classified into physical and chemical absorption, cryogenic or temperature reduction separation, adsorption, and membranes application techniques (Baghban et al., 2015;Liu et al., 2012;Mathieu, 2010). Among all of the mentioned technologies, absorption and separation of the target carbon dioxide with diverse amine aqueous solutions have been broadly used over the past years (Baghban et al., 2015;Bougie & Iliuta, 2010;Mangalapally et al., 2012). The most commonly used amines, with their benefits and defects, in CO 2 capturing process, are monoethanolamine (MEA), N-methyl pyrrolidone (NMP), methyldiethanolamine (MDEA), diethanolamine (DEA), and piperazine (PZ) (Kuenemann & Fourches, 2017;Singh, 2011). Carbon dioxide elimination from the blends of synthesis gases, natural gases, and refinery gases using amine aqueous solutions, has been widely utilized from past to now. Besides their advantages, such volatile organic solvents have shown several detrimental effects such as corrosive byproducts generation when degraded, equipment failure due to water accumulation during desorption process and high quantity of evaporation trait (Baghban et al., 2017;Jou et al., 1982;Kennard & Meisen, 1984;Rho et al., 1997;Speyer et al., 2010). The significant issue of greenhouse gases, demands to seek and develop new, eco-friendly, and efficient technologies for optimum carbon dioxide elimination from the flue gases. Moreover, it is vital to search for novel solvents that are appropriate for acid gas removal applications. Nowadays, Ionic liquids (ILs) have presented unique and desirable properties as particular solvents required for CO 2 capturing process. These kinds of solvents, unlike the amine solutions, are without any damaging effects on the environment and benefits both today's industry and scientific researches. ILs are complexes of cations and anions in an aqueous solvent exactly alike what forms NaCl solution. Light interface or vapor pressure, dissolution feasibility of diverse materials including organic, inorganic, and organometallic compounds, an impressive amount of thermal conductivity and stability, strong polarity and impossibility of mixing with organic solvents are some positive and favorable features of ILs in contrast with organic solvents. Such properties mentioned above, have led these unique and privileged solutions to be introduced as applied solvents for production of compounds with efficient thermal conductivity, metal ions removal, acid gases capturing, etc. Consequently, ILs can act as acceptable alternatives for other CO 2 removal solvents. In the literature, several various pieces of research performed on ILs and its capability for carbon dioxide removal applications, are presented (Afzal et al., 2013;Ai et al., 2005;Baghban et al., 2017;Brennecke & Maginn, 2001;Costantini et al., 2005;Dash et al., 2011;S. Jang et al., 2010;Jou et al., 1982;Kennard & Meisen, 1984;Park et al., 2002;Rho et al., 1997;Sakhaeinia et al., 2010;Sánchez et al., 2007;Seddon, 1995;Shafiei et al., 2014;Shiflett et al., 2008;Shiflett & Yokozeki, 2006;Shin & Lee, 2008;Shokouhi et al., 2010;Song et al., 2010;Speyer et al., 2010;Tagiuri et al., 2014;Tatar et al., 2016;S. Zhang et al., 2005;Zhao et al., 2005). Besides all complexities and difficulties that might be faced when performing laboratory activities, some other negative points such as cost and time limitations and the risks and dangers made by experimental conditions must be considered . Therefore, as a crucial issue, the utility of efficient and precise models required for phase behavior provision, makes it easier to evaluate process condition and related problems. Today, intelligent approaches such as an artificial neural network (ANN) and fuzzy logic, with their various methods, as recognized as appropriate modeling tools and good protectors for a phase behavior assessment Eslamimanesh et al., 2011;Gaines, 1976). These can help accurate and exact validation of experimental results based on statistics without considering precondition of input and output data. As a matter of fact, approaching reliable models for nonlinear databases comes to reality with such different precise methods (Baghban et al., 2019;Faizollahzadeh Ardabili et al., 2018;Ghalandari et al., 2019). Artificial neural network (ANN), a branch of artificial intelligence science, has been used extensively for relating inputs and outputs of different data sources and approaching acceptable models in diverse fields, as yet. ANNs have indicated a helpful and fruitful application for solubility modeling of different gases in ionic and non-ionic solutions (Baghban et al., 2019;Faizollahzadeh Ardabili et al., 2018;Ghalandari et al., 2019;Golzar et al., 2016;Mohanraj et al., 2015;Qiu-Hao & Yun-Long, 2006;Tatar et al., 2016;J. Zhang et al., 2016). As the ultimate purpose of the present research, CO 2 absorption in ionic liquids with a particular ANN's method proves what mentioned above. In the literature, several various ANNs methods have been utilized for carbon dioxide absorption modeling in ILs including: Adaptive Neuro-Fuzzy Inference System (ANFIS), Radial Basis Function Artificial Neural Network (RBF-ANN), Multi-Layer Perceptron Artificial Neural Network (MLP-ANN), Least Square Support Vector Machine (LSSVM), and Committee machine intelligent system (CMIS) (Ahmadi, 2012;Ahmadi, Ebadi, et al., 2013;Ahmadi & Shadizadeh, 2012;Baghban et al., 2015;Baghban et al., 2017;Barati-Harooni et al., 2017;Broomhead & Lowe, 1988;Eslamimanesh et al., 2011;Gaines, 1976;Heidari et al., 2016;Huang & Zhang, 1994;J.-S. Jang & Sun, 1995;Nilsson & Machines, 1965;Santos et al., 2013;Singh, 2011;Tatar et al., 2016). In these techniques, optimization of tuning parameters is performed by using some special algorithms such as Particle Swarm Optimization (PSO), Genetic algorithm (GA), Imperialist Competitive Algorithm (ICA), coupled simulated annealing (CSA), etc. Baghban et al., 2015;Golzar et al., 2016;Hemmati-Sarapardeh et al., 2019;Lashkarbolooki et al., 2013;Qiu-Hao & Yun-Long, 2006;Shafiei et al., 2014;Vapnik & Vapnik, 1998;Zadeh, 1965). In 2017, Baghban et al. employed an extensive dataset aiming for CO 2 absorption modeling in ILs. RBF-ANN, MLP-ANN, LSSVM, and ANFIS techniques were applied for their scientific work (Baghban et al., 2017). These methods are also known as artificial intelligent black box models. In such opaque approaches, as a negative aspect, the internal structure of the system is indeterminate or need not be considered for particular purposes. Thus, practically it is not possible to generate a logical and simple correlation between the input and output data collections Vapnik & Vapnik, 1998). In the present study, the group method of data handling (GMDH), as a novel and applicable method, was used to determine CO 2 solubility in ionic liquids. This method is a significantly applicable one for non-linear and complicated cases treatment, besides not suffering from such black box ANN's limitation manner. In other words, GMDH, as a white box method, makes it possible to detect inner components or logic of the system resulting in a simple and comprehensible mathematical correlation between inputs and output (Sahoo & Ray, 2006;Vapnik & Vapnik, 1998;Zendehboudi et al., 2013). The dataset used here is similar to what employed in Baghban et al. (2017)

Data set preparation
Applying extensive and valid databank aiming to proper and precise mathematical model achievement is a vital stage in the modeling process. In this study, a wide and perfect CO 2 solubility database (including 5368 date points) where employed which are gathered from different references in the literature. This database contains pseudo temperature, pseudo pressure, temperature, pressure and acentric factor as input parameters that are used to make two accurate models capable of predicting carbon dioxide solubility in ionic liquids based on temperature as to the target data. It should be considered that the data points used here are similar to what Baghban et al. used in their scientific work (Baghban et al., 2017). A short description of the statistical properties of the data bank used in the current study is presented in Table 1. Each ionic liquid is distinct from its thermodynamic features of critical temperature, critical pressure, and acentric factor. Moreover, for inspecting the distribution condition of the input and output data points, histogram of data is prepared for them which is illustrated in Figure 1. Based on this figure, distribution desirability and normality of the input and output variables is confirmed by using such diagrams.

Group method of data handling
As mentioned before, to encounter intricate computational problems, different intelligent approaches have been introduced. These smart techniques benefit various algorithms and can simplify complex nonlinear problems resulting in accurate modeling outputs. Group method of handling (GMDH), as one of the intelligent algorithms, correlates the input variables and the corresponding target by an explicit modeling approach. This algorithm was introduced by Ivakhnenko in 1972 for the first time ). Shankar found out the self-organized acting feature of the GMDH approach in 1979 (Bunge, 1963). Thereinafter, some researchers studied this method more and published the obtained results. In their research, it is stated that the intelligent GMDH method is the most efficient one for pattern distinction, detection of function and short/longterm estimation of accidental processes in complicated systems (Nakhaeizadeh, 1992). In other words, it has been detected that other modeling techniques possess significant limitations when compared with the GMDH approach. This method has been utilized in different fields of science and engineering such as weather modeling, multi-sensor signal processing, acoustic and ultrasonic emissions, environmental systems, Eddy currents, medical diagnostics, marketing, X-Ray, etc. (Bunge, 1963;Dargahi-Zarandi et al., 2017;Farlow, 1984;A. Ivakhnenko & Krotov, 1984;A. G. Ivakhnenko et al., 1979;Sawaragi et al., 1979;Vapnik & Vapnik, 1998). It should be noted that the application of GMDH in chemistry and different branches of chemical engineering is rare and few scientific types of research have been implemented in these fields. However, in the present work, some of the related infrequent publications are referred to for someone who tends to gain additional information in the mentioned areas (Atashrouz et al., 2014;Atashrouz et al., 2015;Madala, 2019;Rostami et al., 2019;Varamesh, Hemmati-Sarapardeh, Moraveji, et al., 2017;Zendehboudi et al., 2012). GMDH also called polynomial neural network contains such a layered construction in which several independent neurons or nodes are joined to the primary structure. Every node in this structure is generated by the combination of quadratic polynomial equations of the previous independent neurons . Ivakhnenko in his research presented the most appropriate form of quadratic polynomial expressions which are applicable in the GMDH method. In this method, input variables are related to the output of model by utilizing a particular series introduced by Volterra-Kolmogorov-Gabor as following (Baghban et al., 2017): The variables applied in this kind of equation are defined as below: In the next step, a matrix of output variables is generated based on the above equation. Afterward, input variables are incorporated using a quadratic polynomial and the new variables (Z 1 .Z 2 . . . . Z n ) created by this method are replaced with the neurons in the prior layers which are illustrated as follows: Consequently, the new matrix of variables is in the form of v z = z 1 .z 2 . . . . z 3 . The coefficients of the above equation are determined by utilizing the least square method (LSM). The relation used to examine the similarity of the experimental and predicted data points is presented in Equation (3) (Hosseinzadeh et al., 2016;Madala, 2019).
In this equation, N t and M signify the number of data points placed in a training class and the number of independent parameters, respectively. Then, the generic construction of matrix will be formed by employing quadratic polynomial as below: In which n is the number of input variables, T symbolizes the transpose of matrix and A represents the vector of quadratic polynomial coefficients. Real data sets include training and testing data. Training data are applicable in obtaining coefficients of the equation, however, testing data help for specifying the best incorporation of two parameters. It must be checked whether the model output data and experimental data can be used in the condition below or not: In this equation, if the condition of being lower than the error quantity is satisfied for δ i 2 , then the new independent variable will be saved. Otherwise, it will be eliminated. The error value is calculated and saved in each step of iteration and once the minimum amount of error is obtained the algorithm is stopped.

Hybrid group method of handling
In the current study, the hybrid group method of handling as a proper type of GMDH method was applied. Due to the restrictions, the ordinary GMDH method encounters, this modified type of GMDH was developed and introduced. This technique benefits two exclusive properties: First, an instant combination of more than two input variables and second, nodal junction with different layers. The hybrid form of GMDH method is correlated as follows (Hosseinzadeh et al., 2016;A. Shariati & C. J. Peters, 2004;Varamesh, Hemmati-Sarapardeh, Dabir, et al., 2017;Varamesh, Hemmati-Sarapardeh, Moraveji, et al., 2017): In Equation (8), l denotes the size of layers. In fact, this format of GMDH provides more interplay between the correlating parameters. Therefore, it is capable enough to treat complex systems (Varamesh, Hemmati-Sarapardeh, Dabir, et al., 2017).

Error checking
In order to evaluate the behavior and precision of the model suggested by the GMDH technique, some statistical correlations including R-squared (R 2 ), root mean square error (RMSE), and standard deviation ( (1) R-squared (R 2 ).
In these relations, S expresses solubility and exp and pre signify experimental and predicted, respectively. The statistical indices of the proposed model for CO 2 solubility determination are reported in Table 4.

Graphical error plots
Applying graphical diagrams is another suitable way to evaluate the efficiency of the output model. Two types of graphical plots are utilized in this study: cross-plot and error distribution diagram. In cross-plot, the calculated data points versus experimental ones are demonstrated in a graphical diagram. It helps to assess the performance of the experimental data forecasting process. In the second type, the error deviation from a baseline named zero error line is illustrated. The zero line is a criterion for the obtained model accuracy and being far from this line is interpreted as the incompatibility of experimental and predicted data points resulting in lower efficiencies.

Results and discussion
The present research applies a novel approach for CO 2 solubility modeling in ionic liquids based on temperature (T), pressure (P), critical temperature (T c ), critical pressure (P c ), and acentric factor (ω), as independent input parameters. As it was stated earlier, an extensive data bank of 4726 data points were utilized to construct and develop the model and the hybrid GMDH algorithm was used for this purpose. The database was separated into two classes in which 80% of the data points belong to the training set and the remaining 20% includes the testing set. This classification was conducted by computerized random selection methods. As well as that, 642 data points including 7 different ionic liquids such as ( Another point must be accounted is that the predicted negative values in the proposed models are considered zero. In Figure 2, the schematic illustration of the proposed hybrid GMDH construction used to estimate CO 2 solubility in ILs is exhibited. Here, for temperature lower than 324 K, the nodal network consists of single input and output layers and 5 middle ones. As the ultimate purpose of this research, the following correlations represent the mathematical output of the model (CO 2 solubility calculation in ILs) developed by the Hybrid GMDH neural network.
Furthermore, in this research the other correlation was proposed to estimate CO 2 solubility at range of temperature upper than 324 K exhibited at below: In the above-mentioned correlations, pressure and critical pressure are in (MPa), temperature and critical temperature are in (K), CO 2 solubility is in (mg/l) and acentric factor is a unitless parameter. In addition, the middle layers imply the neurons or virtual input parameters utilized in the GMDH algorithm. These variables, by themselves, are correlated with each other and/or actual statistical data bank parameters. To evaluate the accuracy of the suggested models, error analysis was conducted by the related statistical correlations for the entire databank. The results are presented in Table 2. As illustrated, the amount of R 2 and RMSE for both testing and training set are remarkably close to 1 and 0, respectively. Moreover, the resemblance between the training set and testing one shown in this table, illustrates the lack of an overfitted model. Baghban et al. (2017) developed a high-precision model for predicting CO 2 solubility that reported less error than the model presented in this study. But this model is black box and its use requires special software. The model presented in this research is white box and it is possible to find out what happened inside the model. Also, Table 3 demonstrates the statistical comparison between the proposed models in this study and other models available in the literature of CO2 solubility.
Consequently, due to the appropriate features enumerated, the hybrid group method of handling technique has presented great performance in CO 2 solubility prediction in ILs. In Figure 3, testing and training data  sets are drawn in a cross-plot for the model's outputs. According to this figure, a high concentration of data points including testing and training ones is observed nearby the unit slope line meaning that a proper matching between the predicted and experimental data points has occurred. The efficiency of the model is also concluded from the values of R 2 which is near 1 for both testing and training sets (R 2 testset = 0.90435, R 2 trainingset = 0.90595). Another significant graphical diagram used to evaluate the quality of the proposed model is displayed in Figure 4. As demonstrated, a dense class of data points is located around the horizontal error line which proves the good precision of the developed hybrid GMDH model. Histogram plots of residual are statistical tools to specify the performance of the model and are an  indication of the discrepancy of the actual and predicted values. The distribution of residuals for testing and training data points are displayed in Figure 5. As shown in this figure, the differences between the actual and estimated data points track a normal distribution concluded from the symmetrical bell form of the histograms of residuals. The great accuracy of the ultimate model of this research is also proved by Figure 6 in which the comparison of real and predicted target values by considering the entire data points including the testing and training sub data is demonstrated. Based on this figure, the output values estimated by the suggested hybrid GMDH follow the trend of the actual CO 2 solubility data points in an exact way. Consequently, the claim of the precise and efficient performance of the proposed modeling technique is confirmed.
In order to find out the effect of each input variable on the output of model (CO 2 solubility in ILs), a particular function known as relevancy factor is used. As a particular tool, this mathematical correlation can help to find the effect of input parameters with high impact on the output (Baghban et al., 2017;Yim & Lim, 2013;Zendehboudi et al., 2012). The values of the mentioned function vary between -1 and 1 in which the positive sign determines a direct relationship between the input applied in the relevancy factor relation and output, however, the negative sign shows an inverse one. Larger absolute amount of relevancy factor means that the output parameter is affected by the corresponding input more than other inputs. Relevancy factor is measured based on the following relation: where Inp k,i and Inp k belong to the ith and average values of the kth input, respectively .S is the symbol of ith value of predicted solubility andS shows the mean value of solubilities. k can be each of input parameters of temperature, pressure, critical temperature, critical pressure and acentric factor. The relevancy factor for each input variable (P, T, P c , T c , ω) was calculated by means of Equation (12). The obtained results are displaced in Figure 7. According to this figure, among the input parameters, pressure has the highest value of relevancy factor followed by critical temperature. As a result, CO 2 solubility in ILs is dependent to pressure more than the other input variables. In addition, all of the input parameters except critical pressure possess positive relevancy factor so that changing each of them will alter the output value in a direct manner. On the country, CO 2 solubility in ILs represents an opposite behavior against critical pressure. In other words, CO 2 solubility lowers by increasing critical pressure (      estimated output results signify the high quality of the proposed hybrid GMDH method. With the purpose of outlier determination in data source, William's plot was applied in which standardized residuals versus hat value are plotted. Hat values are calculated using hat matrix which is formulated as below: where X stands to the m × n matrix and m and n are sample size and model variables, respectively. Moreover, the following relation is used to obtain the leverage limit: William's plot for the present study is shown in Figure  9 in which doubtful and reliable data points are determined. Regarding to this figure, the enclosed data points presented between the standardized residuals (y = -3 and y = 3) and leverage limit are considered as valid data points, while the others out of this domain are known as outliers. Experimental data which are identified as outliers based on the leverage approach are listed in Table 4. As reported in this table, 236 data points of the databank containing 5368 data are placed in the outlier category. Figure 10 depicts the pressure as one of the input parameters for two ionic liquids ([bmim][Tf2N] and (Broomhead & Lowe)) versus the laboratory output data and predicted data in this study to investigate the effect of input variables on the performance of the GMDH model. As the laboratory data report, with increasing pressure, the carbon dioxide solubility increases and also the model presented in this research reports an increasing trend for all data in the figure.

Conclusions
In the current study, the performance of a particular smart algorithm for estimation of CO 2 solubility in 67 types of ionic liquids as a function of temperature (T), pressure (P), critical temperature (T c ), critical pressure (P c ) and, acentric factor (ω) was examined. For this purpose, a comprehensive data source (5368 data points) similar to what was used in Baghban et al. paper was applied to be used during the CO 2 solubility modeling process (Baghban et al., 2017). It should be noted that in the present research 4726 data points including 60 ionic liquids were applied to develop two models and 642 data points including 7 ionic liquids were selected for the external test. These two models were developed for temperature lower and upper than 324 K. Baghban et al. utilized LSSVM, ANFIS, MLP-ANN, and RBF-ANN intelligent methods aiming to predict CO 2 solubility in different ionic liquids. In comparison with the enumerated methods, hybrid GMDH neural network as the technique used in this research, illustrated acceptable accuracy which was shown by the statistical parameters calculated. The mean square (R 2 ) value for the suggested GMDH model obtained 0.90435 which was a little lower than its value in the aforementioned modeling algorithms (R 2 GMDH = 0.92, R 2 LSSVM = 0.9942, R 2 ANFIS = 0.91685, R 2 MLP−ANN = 0.9715, R 2 RBF−ANN = 0.9789) nevertheless, a suitable matching between the actual and output amounts of the proposed model is represented which is interpreted as the efficient performance of the GMDH algorithm to predict CO 2 solubility in ILs. In addition, the output of the GMDH algorithm is a simple and explicit mathematical relation. This property is the principal advantage of this technique over the other artificial neural network methods in which black box systems are used. Consequently, the model obtained in the present research as a dependable tool can help various technologies such as CO 2 sequestration and gas purification to have an acceptable prediction of CO 2 solubility in various ionic liquids when there is little information about experimental data points.
For , very few laboratory data are available in the literature. It is suggested to measure the CO 2 solubility in these ionic liquids over a wider range of temperature and pressure. Also, in general, the temperature range used for ionic liquids in this study is between 271.11 and 453.15 K and the pressure range is between 0.0089 and 100.12 MPa, which in future studies, it is suggested to measure the CO 2 solubility for these ionic liquids outside this temperature and pressure range. Furthermore, this model can be used in the temperature and pressure range developed for the 67 ionic liquids mentioned, and outside this range the model probably reports a high error.