Modeling natural gas compressibility factor using a hybrid group method of data handling

The natural gas compressibility factor indicates the compression and expansion characteristics of natural gas under different conditions. In this study, a simple second-order polynomial method based on the group method of data handling (GMDH) is presented to determine this critical parameter for different natural gases at different conditions, using corresponding state principles. The accuracy of the proposed method is evaluated through graphical and statistical analyses. The method shows promising results considering the accurate estimation of natural gas compressibility. The evaluation reports 2.88%of average absolute relative error, a regression coefficient of 0.92, and a rootmeans square error of 0.03. Furthermore, the equations of state (EOSs) and correlations are used for comparative analysis of the performance. The precision of the results demonstrates the model’s superiority over all other correlations and EOSs. The proposed model can be used in simulators to estimate natural gas compressibility accurately with a simple mathematical equation. This model outperforms all previously published correlations and EOSs in terms of accuracy and simplicity. ARTICLE HISTORY Received 14 August 2019 Accepted 9 October 2019


Introduction
The increasing demand for oil and coal as energy and the technological and environmental concerns associated with its production and consumption have drawn attention toward natural gas. The natural gas consumption generates fewer pollutants and greenhouse gases (Kamari, Hemmati-Sarapardeh, Mirabbasi, Nikookar, & Mohammadi, 2013;Shateri, Ghorbani, Hemmati-Sarapardeh, & Mohammadi, 2015). Critical insight into the behavior of natural gas is important to the reservoir and the chemical engineering calculations that deal with gas as one of the main phases. Among the parameters of evaluating the behavior of natural gas, the compressibility is of utmost importance in determining the natural gas phase behavior. Gas compressibility (Z-factor) represents the proportion of volume a given amount of gas to the ideal volume of it at the common conditions and defined pressure and temperature. Gas compressibility makes the difference between ideal gas and real gas. The following relationship is generally used to calculate gas CONTACT Shahaboddin Shamshirband shahaboddin.shamshirband@tdtu.edu.vn compressibility (McCain, 2017): Where V real (V) and V ideal denote the real and ideal gas volumes, respectively. Furthermore, R and z represent the universal gas constant, and the gas compressibility factor, respectively. T, P, and n represent the gas temperature, pressure, and some moles, respectively. At low pressure and temperature conditions, gas molecules have fewer interactions and collisions, and behavior can be considered ideal. However, at a higher pressure or temperature, the collision and interaction of the molecules increase. This phenomena has to be taken into account when making predictions for gas expansion or contraction (Firoozabadi, 1999;Shateri et al., 2015).
There are various techniques to measure the compressibility factor. One main way is by performing compression-expansion experiments. Overall, the experimental measurement of the compressibility factor appeared to be an accurate approach compared to all other approaches. However, they are generally slow, cumbersome, and costly. Also, it is reported not feasible to conduct an experiment for every single condition considering various pressure and temperature at which the compressibility is needed. Using equations of state (EOS) is another approach to determine the compressibility factor. When utilizing EOS, the reservoir characteristics are being employed. Generally, these equations come from the following form when the gas compressibility factor is the target PVT parameter (McCain, 2017): where a, b, and c donate the empirical constants of composition functions for temperature, pressure, and gas. Furthermore, Z denotes the gas compressibility factor. Even though these equations are advantages and their implementation can facilitate the measurement of other gas properties such as enthalpy, entropy, and Gibbs free energy, they are usually implicit higher-order equations that require intense computations. Besides the complex computations, the binary interaction coefficients used in some EOS's need to be measured by conducting experiments that may not be practical. Further, it has been shown that these equations are not suitable for predicting hydrocarbon gas properties (Elsharkawy, 2004). Empirical correlations are another source of determining gas compressibility factor, which is easy and fast to use but is generally associated with erroneous predictions (Kamyab, Sampaio, Qanbari, & Eustes, 2010;Kumar, 2004;Sanjari & Lay, 2012b). A minor estimation error in the compressibility factor of correlations would lead to a false prediction of formation, density and the amount of gas. Therefore, the development of fast, user-friendly, and accurate models to predict the compressibility factor is critical.
Several researchers have attempted to develop methods to estimate the compressibility factor. For instance, Standing and Katz (1942) advanced a graphical technique based on pseudo-reduced. Van Der Waals and Rowlinson (2004) was one of the pioneers of EOS methods by taking into account the intramolecular forces and volume of molecules. Using Van der Waals EOS for determining gas compressibility factor leads to higher accuracy compared to the empirical approach introduced by Katz and Standing. There is a number of researches contributed to the development of reliable EOS's (Soave, 1972).
A general expression for the PVT relationship of fluids has the following form (Shateri et al., 2015): An expression for gas compressibility factor can be written by rewriting the above equation and implementing the equation for gas compressibility factor as follows: Where A, B U, and W are dimensionless parameters that are extracted from the current temperature, composition, and pressure. For the purpose of advancing more efficient, yet faster methods, adequate researchers have developed correlations that can be explicitly used to address the problem. In 1973, Hall and Yarborough (1973) transformed the graphical chart of Katz and Standing into a relatively simple correlation by fitting their correlation to the chart and determining the correlation coefficients. Beggs and Brill (1973) also employed Katz and Standing chart and developed a correlation to estimate the gas compressibility factor. Dranchuk, Purvis, and Robinson (1973) used an EOS developed by Benedict-Webb-Rubin (Simon & Briggs, 1964) and proposed a gas compressibility correlation in 1974. Abu-Kassem joined Dranchuk in 1975 to develop an analytical equation to decrease the gas density which had been utilized to calculate the gas compressibility factor (Dranchuk & Abou-Kassem, 1975). In 1975, Gopal collected multiple correlations for the gas compressibility factor at various conditions (Gopal, 1977a). Kumar (2004) introduced a novel model for gas compressibility factor to be used by Shell. In 2010, Heidaryan et al. used multiple regression analysis (Heidaryan, Moghadasi, & Rahimi, 2010), and in the same year, Azizi employed genetic programming (Azizi, Behbahani, & Isazadeh, 2010). A comprehensive study of the mentioned methods was conducted by Sanjari and Lay (2012a) in which the performance of the methods mentioned above has been investigated.
In this study, a novel supervised approach, namely the group method of data handling (GMDH) was proposed as a robust model to predict the compressibility factor. To do this, a comprehensive data bank of compressibility factor was used considering a diverse range of pressure, temperature, and composition. Several statistical quality measures and graphical methods were utilized to calculate the model performance. Furthermore, for the evaluation of the proposed graphical and the statistical methods, the root mean square error (RMSE), average absolute percent error (AAPRE), regression coefficient (R 2 ), average percentage relative error (APRE), cross-plot, and error distribution curves have been used. Additionally, the performance of previously published well-known correlations and EOSs was investigated and compared to the proposed model.

Data acquisition
The reliability of any intelligent model is dependent on the data bank that has been utilized for the purpose of model training and testing. Here, various range of pressure, temperature, and gas composition conditions were used to ensure the development of a valid model that can determine the gas compressibility factor. Two dimensionless parameters of pseudo-reduced for temperature and pressure are studied to be used in the developed model for predicting gas compressibility. These parameters are calculated from the current pressure and temperature, as well as the pseudo critical pressure and temperature calculated as follows: where T pc and P pc provide the values of pseudo critical temperature and pressure, respectively. Furthermore, P pr and T pr are pseudo-reduced pressure and temperature.
For gas with multiple components, P pc and T pc are calculated from the critical temperature and pressure of the individual components as follows: In these equations, T ci and P ci provide the values for the critical temperature and pressure of component i. Also, y i describe the mole fraction of component i. The data sets in this article gathered from the previously published research and the online repositories, e.g. (Elsharkawy & Foda, 1998;McLeod, 1968;Robinson & Jacoby, 1965;Simon & Briggs, 1964;Whitson & Torp, 1981;Wichert & Aziz, 1972). These data were also used in our previous publications (Kamari et al., 2013;Shateri et al., 2015). Further, Table 1 represents the statistical details of the data bank used. As the table demonstrates, the pressures, temperatures, and compositions comprise a comprehensive range, ensuring that the developed model based on this data set would be a reliable predictor for various types of natural gases at different conditions. Nine hundred seventy-eight data points were used in this study. Eighty percent of the data was used for training and model development and the remained 20% was used to check model performance and validity. The data were divided into training and testing sets randomly.

Evaluating the model performance
There are multiple statistical parameters that are employed to assess the performance of a model. The parameters that were used in this work include APRE, AAPRE, RMSE, and R 2 . A simple presentation of the mentioned parameters is presented here: 1. APRE (E r %): where E i stands for the relative variation of predicted value from an experimental value expressed as Percentage Relative Error: 2. AAPRE: 3. RMSE: 4. R 2 : In these formulas, O is the mean value of experimental data output. Another approach to evaluate the performance of a model and compare it with other models is the usage of graphical error analysis. The graphical approaches used in this study are cross-plot, frequency vs. absolute relative error, error distribution, and trend analysis curves. Cross-plots are utilized to assess the performance of a model in which the estimated data by the model are plotted against the experimental values, and one can observe the accuracy of the model depending on how close the trend is to a unit-slope line that crosses the origin. Further, the cumulative frequency of data points against the absolute relative error is plotted to quantify the number of data points that can be accurately predicted by the model. Besides, the error distribution curve was investigated to identify the error trend of the model when an independent variable is increased. The proximity of data points to the zero-error line tests the precision of that model. Finally, a trend analysis is performed to investigate whether or not the developed model can accurately estimate the trend of gas compressibility factor at different pressures.

Model development
GMDH, every two independent parameters are coupled with a quadratic polynomial expression and form M 2 new variables as follows: And the new matrix can be represented by the new variables as follows: In the next step, the least square method is utilized to reduce the difference between the actual data and the model predictions as presented below: In this equation, the quantity of data points in the training set is shown by Nt. In the next step, the general matrix is written as follows: Writing the general matrix in the above form helps with offering a general formulation to determine the unknown quadratic polynomial coefficients as shown below: In the later stage, the data set is divided into subsets of testing and training, the model coefficients are obtained during the training stage, and the testing set is utilized to determine the best combination of the two independent variables based on the following condition: The combined variables will be stored if this criterion is met, otherwise the algorithm eliminates this combination of two variables and the iteration will continue. More information about this modeling approach can be found in our previous work (Hemmati-Sarapardeh & Mohagheghian, 2017). This method was recently used for modeling gas compressibility factor of Iranian gas reservoirs (Shariaty, Khorsand Movaghar, & Vatandoost, 2019).

Results and discussion
The GMDH has been successfully implemented as an evolutionary intelligent model to predict the natural gas's compressibility. The inputs of the model were gas composition, pressure, and temperature, as shown in Table 1.
In addition to C 1 -C 6 and C 7+ components, H 2 S, CO 2, and N 2 gases were considered as the components. The pressure and temperature values are utilized to estimate the pseudo-reduced pressure and temperature as discussed previously. During the data acquisition stage, a wide range of input parameters was considered as shown in Table 1. The pressure ranges from 154 to 7026 Psia, reservoir temperature ranges from 40 to 300°F, and the compressibility factor values cover a wide range of 0.4-1.24. The distribution of input and output data is illustrated in Figure 1. In addition to the data distribution, normal distribution curves were plotted in this figure. Pseudo reduced pressure and temperature were the chosen parameters for this figure due to their impact on any reservoir fluid properties. As can be seen, all three data sets follow a relatively normal shape, especially the gas compressibility factor data. The mean values for the pseudo-reduced temperature, pseudo-reduced pressure, and gas compressibility factor are 1.5, 3.9 and 0.9, respectively. The bin size in all cases is 40. The model's schematic flowchart is illustrated in Figure 2. In order to assess the accuracy of the developed model, various statistical and graphical methods such as average absolute relative error and error distribution curve were employed as will be discussed in this section. After optimizing the model, the genome and nodal expressions were obtained as follows:  (25) where T pr and P pr represent the pseudo-reduced temperature and pressure, respectively, and N 2 -N 6 represents the virtual variables or nodes of the model. Here, the equations are second-order polynomials that are utilized to predict the gas compressibility factor. Considering the first equation, the gas compressibility factor will be estimated through realization of the pseudo-reduced temperature, and the virtual parameters N 2 and N 4 and. N 4 can be calculated by knowing pseudo-reduced pressure and by calculating the virtual parameter N 6 . N 6 is a simple function of temperature and pseudo-reduced pressure and can be directly calculated. In order to calculate N 2 , the virtual parameters N 3 and N 5 are needed in addition to N 6 . N 3 can be calculated after calculating the virtual parameter N 4 . Finally, N 5 is directly calculated similarly to N 6 by having pseudo-reduced pressure and temperature. The graphical techniques and statistical quality measures were applied to evaluate the accuracy of the developed model. Furthermore, EOSs are used for further evaluation and comparison. Five EOSs are considered, i.e. Van der Waals, Threfall, Adair, and Roth (1873), Lawal-Lake-Silberberg (Lawal, 1999), Peng and Robinson (1976), Soave-Redlich-Kwong (Soave, 1972), and Patel and Teja (1982). The empirical correlations include Dranchuk, Purvis, and Robinson (1974), Dranchuk and Abou-Kassem (1975), Brill and Beggs (1973), Shell Oil Company (Kumar, 2004), Gopal (1977b), Hall and Yarborough (1973), Sanjari and Lay (2012a), Heidaryan et al. (2010), Azizi et al. (2010), and Kamari, Gharagheizi, Mohammadi, and Ramjugernath (2016). The list of the outcomes of a statistical assessment of the GMDH model and the previously published correlations and EOSs are available in Table 2. Evidently, in this table, the proposed model demonstrates the most accurate performance compared to other models when comparing the RMSE, the average absolute percentage relative error, and the regression coefficient. The Hall Yarborough correlation was found to be the next most accurate model followed by the Patel-Teja EOS model based on the statistical information presented in this table.
Noticing the APRE and AAPRE values of the models in this study reveals that, while the APRE value of some of the models is not smaller than those of others, their AAPRE is. The definition of APRE can be used to explain this observation. As a matter of fact, APRE cannot be used by itself to approve or reject a model. For instance, an APRE value close to zero would be obtained if half of the data points are overestimated by a model and the remaining half are underestimated, which would in return present a false assessment of the model performance. However, if some data points are estimated accurately by a model and the remaining few data points are either underestimated or overestimated, a positive or negative APRE value would be obtained, respectively, which again cannot testify the model performance. As an example, the Van der Waals's model has a much smaller APRE value than that of Dranchuk-Purvis-Robinson. However, it is considered to be a less accurate model than the latter one. Another way to illustrate the performance of the models and to compare them more comprehensively is by using graphical analyses. The performance evaluation is done through several graphical analyses considering a number of most popular models from both quantity and quality standpoints. Figure 3 presents the cross-plot of the developed model in which the calculated data is plotted against the measured data for both testing and training sets. The location of the majority of the data points on the y = x line provides the precise predictions of the developed GMDH model for the training and testing phases. The error distribution of the model is shown in Figure 4. As the figure shows, most of the data points are near the zero percent error line. Therefore, as the  gas compressibility factor increases, the GMDH model shows no systematic error trend. Worth mentioning that most of the previously published models suffer from an error trend. The figure indicates that the error of the testing set is smaller than the training set. The distribution of the relative error of predictions is plotted in Figure 5. In addition to the distribution of data points, the normal distribution is plotted indicated by the red line. The figure indicates that a Weibull or lognormal distribution may better describe this distribution and that most of the data points (predictions) have a relative error close to zero demonstrating the accuracy of the model in predicting gas compressibility factors. Figures 6 and 7 depict the statistical information reported in Table 2 for a graphical demonstration of the performance of the models. Further, these figures indicate that models with a smaller RMSE do not necessarily have a smaller average absolute percent relative error. This means that when performing statistical analyses, care should be taken to avoid the misinterpretation of the results by focusing on only one statistical parameter.
As can be seen in Figure 7, Beggs-Brill, Van der Waals, Gopal, Azizi, and Shell Oil Company models have the highest AAPRE values meaning that their predictions are less accurate than other models. It can be observed that the proposed model outperforms the former models by the AAPRE of 2.89%, 2.85%, and 2.88% for the training, testing and entire data sets, respectively. Further insight into performance of the model can be observed in Figure 6 with lower RMSE value compared to former published models.
In order to investigate the accuracy of the GMDH model versus former models' correlations, a cumulative frequency plot of the models are illustrated in Figure 8. This figure helps to achieve a better quantitative evaluation of the developed model when compared to the previously published models. This figure shows that the GMDH predicts nearly 52% of the data with an absolute relative error of 2%. More importantly, the model can predict 75% of the data points with an absolute relative error of only 4%. Thus, the model predict the gas compressibility factor with the smallest absolute relative error for any number of data points included in the predictions. This verifies the consistency of the developed GMDH model in accurately estimating gas compressibility factor within a wide range of reservoir conditions. The figure also shows that Hall-Yarborough correlation is the second model with high performance and accuracy that   can estimate gas compressibility factor with small error values when any number of data points are considered. Another important finding from this figure is the comparison between the Lawal-lake-Silberberg EOS model and the Dranchuk-Purvis-Robinson correlation. While Lawal-lake-Silberberg EOS is more accurate in estimating gas compressibility factor for up to 50% of the data points, Dranchuk-Purvis-Robinson correlation becomes the superior model when more than 50% of the data points are included. This trend changes again when 80% or more of the data points are included.
Finally, the predicted compressibility values by the GMDH model visually represented in Figure 9 against the experimental data at different pseudo-reduced pressure values and constant pseudo-reduced temperature of 0.7 to confirm the capability of the developed model in accurately estimating natural gas compressibility factor at different conditions. The experimental trend in this  figure indicates that by increasing the pressure, gas compressibility factor first decreases and then increases. This trend has been accurately estimated by the developed GMDH model as shown by the dashed line in the same figure.

Conclusions
In this work, GMDH was used to estimate the compressibility factor of natural gas. The GMDH model outperformed other models and provided higher accuracy at the various gas conditions. This was confirmed by measuring the RMSE, average absolute percent relative error, and regression coefficient to be 0.03, 2.88%, and 0.92, respectively. The Hall Yarborough correlation was determined as the second most accurate correlation for estimating the natural gas compressibility factor. Besides, the error distribution curve analysis indicated that the presented model in this study does not have an error trend when predicting very low and very high compressibility factor values. The experimental trend of gas compressibility factor showed that by increasing the pressure, the Z-factor first decreases and then increases. This trend was perfectly shown by the developed GMDH model in this work.
The results of this study show that generally, the correlations in the recent researches are based on limited data sets. Therefore, they are only able to predict the compressibility factor within limited ranges of pressure and temperature conditions. While the proposed GMDH model can accurately predict the natural different gas compressibility factor at low and high temperature and pressure conditions. The GMDH model is a data-driven based model, and its accuracy and reliability strongly depend on the data set used for its development. Using this model for temperature and pressure conditions beyond this study may lead to more error.

Disclosure statement
No potential conflict of interest was reported by the authors.