Toward predictive modelling of construction cost overruns using support vector machine techniques

Abstract The development of cost overrun prediction models using data mining techniques has considerably increased in recent years. Estimating the final cost of construction projects is essential during the contract award stage of the building process. Projects variables from archival data are important in developing prediction models. This research examines the effectiveness of support vector machines in predicting construction project cost overruns using data from archival records.The independent variables, like number of stories, gross floor area, change in scope, contract type, provisional sum, tendering type, and initial contract sum, were extracted from historical records. In this study, SVM models using linear, RBF, and polynomial kernel functions demonstrated that SVM using linear and polynomial kernel techniques were used in this research. This study looks at how well data mining tools forecast cost overruns in building projects using information from historical records.The results revealed that the linear kernel SVM model could produce accurate construction cost predictions with 0.99 R2, 0.099 RMSE, 0.05 MAE, 0.278 MAPE, and 0.01 MSE on the accuracy test data. When considered collectively, it is clear that gross floor space, story count, tendering method, and scope modification are reliable indicators of cost overruns in the construction sector.The created SVM model can be applied as a cost-estimating tool to predict potential cost overruns for Ghanaian construction projects.


Introduction
Cost performance is among the objective indicators for evaluating the success of construction projects.Projects in the construction industry are always liable to cost overrun issues (Asghari et al., 2021;Cheng, 2014;Chipulu et al., 2019;Coffie et al., 2019;Shehu et al., 2014;Sweis, 2013).Prior studies have proved the danger of litigating cost escalations.Cost-benefit is one of the primary concerns and significant variables affecting whether a project is completed (Chipulu et al., 2019).
Construction cost is the owners' portion of expenses necessary for the projects' effective completion (Subramani et al., 2014)."Cost overruns" in the literature relate to situations where a project's overall cost exceeds the initial cost.According to Love et al. (2015), cost overrun denotes "a budget increase, cost increase, or cost growth."The term "cost overrun" is the difference between the final cost of the project and the contract award amount (Flyvbjerg, 2014;Kaliba et al., 2009;Sun & Meng, 2009).Cost overruns are distinct from cost growth or escalation primarily in that cost escalations are anticipated but not overruns (Love et al., 2015).Numerous circumstances can cause cost overruns in construction projects (Heravi & Mohammadian, 2021).
Research into cost overruns in construction projects has been in existence for a long.Early available studies on this topic and the works of Flyvbjerg suggest two reasons why projects overrun their time and cost: a strategic misrepresentation and optimism bias (Flyvbjerg, 2014).However (Love et al., 2011a), argued that there are instances where strategic misrepresentation and optimism bias cannot be said to occur, yet projects' cost and time overrun, especially in social infrastructure delivery (Love et al., 2011b), intermediary occurrences and actions are the foremost cause of overruns in their quest to buttress this point.These pathogens are more design-related and, more specifically, design errors.
Similarly, numerous factors are thought to affect building costs and cost overruns.When employed properly, these factors can help create prediction models.The causes of cost overruns have been identified as risk and uncertainty that were not anticipated at the Project's inception (Ökmen & Öztaş, 2010), changes in scope during the project execution phase, or as fresh information are available (Love et al., 2005); strategy misrepresentation as regards planners behavior and the optimism bias that projects go as proposed (Flyvbjerg, 2009;Lovallo & Kahneman, 2003); and, finally, the view that practitioners are vulnerable to mischief.
An increasing research text has analyzed the correlation between project factors and projected the extent of overruns (Al-Razgan et al., 2014;Coffie et al., 2019).The studies that modeled the association between variables and cost overruns bring up two significant difficulties.The findings of the research initially appear to be in conflict.For instance, studies on the percentages of overruns in the road sectors have contradictory statistics.Second, there are problems with which machine learning provides an efficient outcome.Assessing the comparative ability to use variants within a machine learning technique is necessary for modeling construction projects' cost overruns.
This study, therefore, seeks to compare SVM models using linear, RBF and polynomial kernel functions techniques on construction project cost overrun modeling based on archival data.The work focuses on prediction accuracy and the suitability for interpreting cost trends.
Although the psychological strategists' argument has some merit, some contend that the evidence is unreliable and cannot be scientifically confirmed because no proof of a causal relationship is shown (Love et al., 2011b(Love et al., , 2015)).Flyvbjerg et al. (2002)"s psychological strategists" perspective was specifically criticized by Osland and Strand (2010), who concluded that Flyvbjerg et al. employed the logic of doubt in their claim that inaccurate cost forecasting results from optimism bias and strategic misrepresentation.Love et al. (2011b) suggested that cost overruns happen due to a series of pathogenic impacts latent in the project system, countering evolution theories that rely on precisely planned events.However, before such characteristics are apparent, project stakeholders are frequently unaware of the effects that certain decisions, behaviors, and procedures might have on project costs.As a result, studies such as (Aibinu & Pasco, 2008;Odeck, 2004;Odeyinka et al., 2012) substantially support this viewpoint.In essence, overruns are not a matter of projects not going according to plan (budget) but instead of plans not going according to Project (Ahiaga-Dagbui & Smith, 2014;Love et al., 2011b).Latent circumstances, technological designs, risk profiles, or contingencies, in other words, do not always reflect what was anticipated at the planning stage.As a result, Love et al. (2018) supported the idea that construction stakeholders may develop strategies to manage and control the cost of project execution if they could ascertain the possibility of modifications to the final budget.

Use of data mining techniques
On the assertion that causes of cost overruns must be supported by scientific evidence, efforts were made to create cost overrun prediction models.A predictive model was built in the specialized sectors of bridge maintenance, road resurfacing, and construction (Bordat et al., 2004).They identified these elements as causal: starting project time, project location, weather, bid amount, bid difference from consultant estimate, successful bid, and coemption in winning the contract.Using multiple linear regressions, these variables were used to analyze the correlation between factors and overrun.Attalla and Hegazy (2003) created two models for cost overrun using a neural network (ANN) and multiple linear regression-other factors than the above-stated seven were found in their work to lessen the impact on cost overrun.However, the performance of the constructed model appears to be close, with ANN performing just a bit more improved than the MLR.
Odeck created an MLR model for cost overrun in the road construction industry in Norway in 2004.He believed smaller projects tended to go over budget more than bigger ones.A model in the area of water projects was created using ANN on 1600 projects (Ahiaga-Dagbui & Smith, 2014).Their model's performance is roughly 87% of the actual cost.Project scope, purpose, operating region, delivery partner, project duration, expected cost, type of soil, site access, site condition, tendering technique, number of projects, and project requirements are the factors covered in the model.Similar models were developed using variables such as the number of stories, gross floor space, and procurement route (Walker, 1995).According to Love et al. (2005), story heights and gross floor area were considered important variables when developing cost overrun models.
Data mining methods help identify the basic patterns in massive amounts of data.Regression, association rule discovery, classification, and clustering problems can all benefit from these methods (Cortez et al., 2009;Romero & Ventura, 2007).The capacity of data mining algorithms to identify nonlinear correlations in data is one of their primary strengths.Data mining methods have been successfully used to address problems in a variety of industries, including the engineering (Chou et al., 2011), agriculture (Cortez, 2010), the medical industry (Delen, 2009), finance (Yeh & Lien, 2009), and construction (Bala et al., 2014).To estimate cost overruns in construction, the current study evaluates the predictive ability of potential support vector machine (SVM) data mining technique variants.
Consequently, SVM models using linear, RBF, and polynomial kernel functions were used in this study.These techniques were used to model and forecast construction project cost overruns.Furthermore, the effectiveness of using these depths of techniques for predicting the cost overrun of construction projects was measured and compared.

Data collection
Data collected for analysis in this study came from contractors, development units of universities, and building band civil engineering consultants in the private sector.These projects were executed between 2011 and 2016 and adjusted for inflation.The data were homogenous, especially on new building projects for the educational sector.The selected projects were government sector projects chosen from when the Ghana government decided to expand infrastructure within the education sector.The following factors were determined to be used in the model's creation by extracting information from historical data that was readily available and consulting with construction experts involved in the delivery of these projects.The number of stories, and gross floor area, as suggested by Love et al. (2005) and Walker (1995), indicated the scope of projects was extracted.

Cost prediction model based on a support vector machine
Based on statistical learning theory, the Support Vector Machine (SVM), a classification and regression approach, is a machine learning technique.SVM is based on the Vapnik Chervonenkis (VC) theory and the Structural Risk Minimization (SRM) concept (Vapnik, 1999).By simultaneously decreasing the upper bound of the generalization error and minimizing the fitting error, the formulation adheres to the Structural Risk Minimization (SRM) concept rather than the more conventional Empirical Risk Minimization (ERM).The improvement gives the SVM model fantastic generalization capabilities.Data from input features are non-linearly mapped by SVM employing a kernel function into a high-dimensional feature space.The performance of the SVM is impacted by the complexity of the regression function determined by the kernel function.
Finding a linear regression equation that minimizes the overall variance of the sample points away from this regression hyperplane and fits all of the sample points is the goal of the SVR.
Finding a linear hyperplane that matches the multiple input vectors to the output values is the goal of Support Vector Machine Regression (SVR).The result is then applied to forecast future output values that are present in a test set.
There is a sample training set ð Þ; and each input x will always have a matching y-value.Such a function f x i ð Þ can be described as follows.
where w 2 R N is the weight vector, b is the bias, and φ x i ð Þ maps the data into high-dimensional feature space.
Equation 1 can be precisely fitted to each sample point ε.
The penalty parameter C is introduced as there is a specific fitting error, the slack variables The issue of regression fitting is transformed into an issue of optimization.An ideal hyperplane is difficult to locate; hence it is a convex optimization issue.
For Equation 3, Lagrange multipliers are used to generate the Lagrange function's dual form.At this phase, the main goal of optimization is to first identify the feature space, then locate the flattest function in that space that complies with the requirements, and utilize that function to resolve the nonlinear issue.When the kernel function

the regression fitting function becomes
There are many choices of kernel functions.SVR relies heavily on the kernel function.It converts nonlinear issues into linear or roughly linearly separable issues by mapping the input vectors onto a higher dimensional feature space.These include linear, polynomial, radial basis, and sigmoid functions (Chapelle et al., 2002;Saunders et al., 1998).With the appropriate parameters, various kernel functions result in various prediction estimates (Müller et al., 2018).Choosing the kernel function and necessary parameters becomes a crucial issue in a support vector machine.When selecting the kernel function, the necessary parameters must be carefully specified.

Cross-validation and grid search
The selection of hyperparameters is one of the key difficulties in machine learning models.The selection of these hyperparameters for support vector machine models depends on the kernel function.This investigation uses the polynomial kernel, radial basis, and linear basis functions.The RBF kernel has two hyperparameters, the linear kernel has just one, and the polynomial kernel has three.There is a unique hyperparameter for each kernel that needs to be estimated.The choice of hyperparameters of SVR affects the performance of the model.The penalty parameters C, gamma, and degree parameters should be chosen properly to compare the prediction accuracy with the different kernel functions.To find the best parameter values for linear, radial basis, and polynomial kernel functions, we employed the grid search method by Hsu et al. (2003) with 10-fold crossvalidation.The grid search approach iterates through all the potential parameter pairings to find the best set of hyperparameters and then evaluates the effect of each parameter combination on the model's performance.The best linear kernel parameter, C, must be chosen using the training data.The study uses grid search to select the best combination of C and gamma for the RBF kernel.The ideal combination of the three parameters for the polynomial kernel function was also chosen using the training data.
The training set is divided into k folds of equal size according to the k-fold cross-validation method employed in this work, with the other k-1 folds acting as training examples.In contrast, each k-fold serves as a validation set.Each event is then predicted once.The prediction results are averaged across the various folds to evaluate the performance.Cross-validation's justification is to reduce the likelihood of overfitting and improve the regression's capacity for generalization.
Using a grid search means that several C and degree combinations are tested, and the combination with the highest score across all k folds is chosen.After discovering the optimal parameters, the model was retrained to create the final regression model.

Data processing
Data pre-processing, which includes data cleaning and transformation, is useful before applying a machine-learning model (Han et al., 2012).These data pre-processing techniques are applied to the raw data to clean it and make it consistent by converting data values into a specific range (Patel & Mehta, 2011).In machine learning, the process of scaling features into a dimensionless unit is known as feature standardization or scaling.To achieve this, calculate the means and standard deviations of all the features in the training data and then apply z-score standardization to the features to convert them into a space with unit variance and a mean of zero.The transformed value is produced by deducting the mean from the original value and dividing it by the standard deviation.The feature vector is transformed into a format appropriate for machine learning methods.The test data set was also standardized using the same standardization parameters as the training data using the same standardized training data (zero mean and unit variance).

Data partitioning
After collecting the data from different regions, we shuffled the data to prevent bias during modeling.The dataset was split into two subsets: training and testing data.The training data was used to train the models, and the test set was used to evaluate the performance of the models.The performance of the SVM using different kernel functions were compared using both the training and test dataset.

Model evaluation metrics
We compute different performance metrics to evaluate the performance of SVR using different kernel functions for cost prediction.The K-fold cross-validation has been used to assess the performance of machine learning techniques (Kumar et al., 2015).We conduct a 10-fold CV to evaluate the performance of the models.We used CV to train the models in the two data sets, and the metrics are calculated over the data set.The efficiency of the models was assessed and compared using mean square error (MSE), mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and R2.
Where y i denotes the actual prices of cost overrun, b y i denotes the predicted value, y i is the mean of y i , n denotes the total of datasets used.
We illustrate the model's performance using a scatter plot to visualize the forecast confidence and capacity to extrapolate on unseen data and report and compare the model performances using statistical metrics.

Results
In Table 1, the descriptive statistics are displayed.Table 1 shows that the projects used in the study had an average duration of nine months, with a minimum and maximum duration of one and 24 months, respectively.The smallest gross floor space was 12 meters, and the largest was 8902 meters.The gross floor area was 1000.9 on average.The average number of stories is 1.5, while 1 and 6 are the minimum and maximum numbers.The study's projects had a minimum completion cost of 13,912 cedis ($ 4,637.33)and a maximum completion cost of 9,978,874 cedis ($33,326.29).The Project's average cost to complete is 1,069,461 cedis($ 356,487.00).The study's projects had a minimum contract sum of 12,648 cedis($ 4,216.00), a maximum contract sum of 9,030,655 cedis ($ 3,010,218.3),and an average contract sum of 959,919 cedis ($ 319,973.00).
Pearson correlation coefficients between the variables were computed and represented in Table 2 and Figure 1.Results suggest that there is a positive and statistically significant relationship among the variables.All the correlation coefficients among the variables were above 0.7

Parameter tuning
The grid search for the training data led us to conclude that 1.25 is the ideal value for C. Various combinations of C and gamma were tested for the radial basis function to select the best values.Following the grid search, 0.01 and 4, respectively, were chosen as the best values for C and gamma.The best parameter pair created the model using the training data.The polynomial kernel function has more hyperparameters than the radial basis and linear kernels hence takes longer to train the model.The optimal parameter pairs for the polynomial kernel functions were C = 1, gamma = 0.2, and degree = 1.The model was again trained using the optimal parameters.

Results of model
After the 10-fold cross-validation, MAPE, RMSE, MAE, MSE, and R2 were obtained for each of the models for the training and testing data.SVM-linear, SVM-RBF, and SVM-polynomial for training and testing data.To estimate construction cost, 10-fold cross-validation was used to compare the effectiveness of support vector regression models (SVM-linear, SVM-RBF, and SVM-polynomial).All assessment metrics results revealed that SVM-linear and SVM-polynomial had the best prediction capability with the highest R2 and lower values for the rest of the evaluation metrics (MAE, MSE, RMSE, MAPE) compared to SVM-RBF as shown in Table 3.The results show that SVM-linear and SVM-Polynomial outperform SVM-RBF for the training and test data.Consequently, the SVR-Linear and SVM-Polynomial were the most effective and competent models for predicting cost overrun.
Figures 2, 3, and 4 show the model performance using scatter plots for the SVM-linear, SVM-RBF, and SVM-polynomial models.As seen in Figures 2, 3, and 4, the models more precisely estimate the building cost and offer estimates of prediction uncertainty.
After the 10-fold cross-validation, MSE, MAPE, RMSE, MAE, and R2 were obtained, as shown in Figures 5 and 6 for training and test data, respectively.

Findings
The findings indicate a considerable positive correlation between the remainder of the features and completion cost.The model is validated by using a scatter plot of observed and predicted values.Figures 2, 3, and 4 present the plot validation comparing the actual and predicted values for SVM-Linear, SVM-RBF, and SVM-Polynomial, respectively.• The diagrams show a strong correlation between the model's prediction and the actual values, suggesting a satisfactory fit of the models.
• The SVM-linear, SVM-RBF, and SVM-polynomial models demonstrated good performance in terms of accuracy in prediction, MSE, MAE, RMSE, and MAPE.
• Each model has low values for MSE, MAE, RMSE, and MAPE, indicating the good performance of each predictive model in forecasting construction costs.
• The highest values for adjusted R square are recorded for SVM-Linear and SVM-polynomial models at 98.3% and 98.3%, respectively.It is worthy to state that the adjusted R square for SVM-RBF is also very high at 98.2%.

Discussion
The study was conducted with 911 cases of construction projects in Ghana between 2011 and 2016.Of these cases, 728 were used to train the model, and 183 were used for testing the models.To evaluate the accuracy of the SVM model in predicting construction cost, three different kernels, linear, RBF, and polynomial kernels, are selected for comparison.
• The performance of the three models was compared using various evaluation metrics.It is observed that the best prediction model based on the values of MAE, MSE, RMSE, MAPE, and adjusted R square is SVM-linear and SVM-polynomial.
• The result is not surprising as it is in agreement with other findings.
• In a study conducted by Lin et al. (2019), the authors found the PCA-PSO-SVM model to have a high predictive accuracy compared to other models.
• Our results show slightly better prediction performance than that of Tijanić et al. (2020), who used ANN to estimate the cost of road construction in the Republic of Croatia.
• Also, comparing the performance of our results to that of (Al-Zwainy & Aidan, 2017), who forecasted the cost of the structure of infrastructure projects using ANN, we had better model performance.

Conclusion
During the execution of building projects, a vast amount of data is produced.Much may be done with the information generated when properly documented and saved for use.When meaningful information is extracted from it for additional analysis, it might be important to all stakeholders, but this work must be done during project delivery in a way that makes it simple to utilize again.
SVM was used to create and evaluate cost forecasting models for 911 building projects that were carried out between 2011 and 2016.The SVM model with linear and polynomial kernels produced a more reliable and superior outcome.To show that SVM models employing linear and polynomial kernels produce the best accurate construction cost valuations, we first compared the prediction precision of the models using linear, RBF, and polynomial kernel functions.On the test data, the model with a linear kernel can estimate construction costs with an accuracy of 0.99 R2, 0.099 RMSE, 0.05 MAE, 0.278 MAPE, and 0.01 MSE.The model's output, particularly the one created using a linear kernel, points to a bright future for cost estimation.It demonstrates how better data storage could increase the construction cost prediction's accuracy.Where data is available and can be expanded, the approach taken in this work could be applied to other estimation techniques.

Figure
Figure 1.Correlation coefficient results.

Figure
Figure 2. Scatter plot for the SVM-linear model.

Table 1 . Descriptive statistics of project
Table 3 compares the best prediction performances of

•
All three models fit the data satisfactorily.The low values for MSE, MAE, RMSE, and MAPE and high values for adjusted R square suggest that the model's training and test data predictions align with the actual values.The models can capture the relationships in the data and other underlying patterns.