Reverse designs of doubly reinforced concrete beams using Gaussian process regression models enhanced by sequence training/designing technique based on feature selection algorithms

ABSTRACT The present paper introduces a practical and convenient artificial intelligence-based design approach for doubly reinforced concrete (RC) beams. Completed designs are automatically obtained from regression models and back-substitution (BS) procedures, which satisfy the preassigned flexural strength, curvature ductility, and calculate serviceability parameters. In addition, regression algorithms are developed by training multiple Gaussian Process Regression models on structural data. Furthermore, feature selections and Chained training scheme with Revised Sequence (CRS) techniques are implemented to enhance the training accuracy, providing acceptable accuracies (less than 0.7% errors) in 91 interpolation designs. First, CRS procedures are employed, improving the regression accuracy by sequentially predicting outputs, using predictions of predecessor steps as inputs for the successor ones. In doing so, the preciseness of models is improved as training continues. Appropriate inputs and reasonable output sequences for CRS are determined using a feature selection-based procedure for obtaining optimal training. This procedure implemented three feature selection methods (F-test, Neighborhood Component Analysis (NCA), and RReliefF) in a greedy algorithm, evaluating relations among design parameters. In summary, a direct design approach of a doubly reinforced concrete beam is presented, which enables engineers to control moment capacities and curvature ductility easily, replacing ineffective iteration-based conventional design procedures.


Literature review
During the last two decades, artificial intelligence (AI) has been implemented in various aspects of construction industries, providing evolutionary design, management, and monitoring solutions (Sun, Burton, and Huang 2020).A common topic has been structural health monitoring, such as in Rafiei and Adeli (2017), Tibaduiza et al. (2018), Melville et al. (2018), Hoshyar et al. (2020), Al-Rahmani, Rasheed, and Najjar (2013), Kang and Li (2020), Kang et al. (2015), Kang, Li, and Dai (2019), and Hoshyar et al. (2017), where structural damages were detected and safeties were examined using machine learning (ML)-based algorithms, providing convenient evaluation tools for engineers.In addition, pattern recognition-based investigations using Support Vector Machine (SVM) were conducted to evaluate residual capacities of damaged structures in Zhang and Burton (2019), Yan et al. (2014), Shyamala, Mondal, and Chakraborty (2016), and Kohiyama, Oka, and Yamashita (2020), their preciseness was assessed through the mean square errors (MSEs) of the testing datasets.For component design, the authors in Naeej et al. (2013) and Alacalı, Doran, and Akbas (2006) used experimental data to train and test regression models, with an aim to calculate the lateral confinement coefficients of reinforced concrete (RC) columns.Another example is Olalusi and Awoyera (2021), where shear capacities of slender RC beams strengthened by steel fibers were predicted using ML models trained on 326 experimental data sets.On the other hand, it was also common to employ artificial intelligence technologies to discover behaviors of materials, for example, mechanical properties of concrete mixtures using waste foundry sand were simulated in Behnood and Golafshani (2020) using decision tree algorithms based on collected data.In these abovementioned studies, data collection and validation were the major difficulties encountered.Therefore, the authors in Mitropoulou and Papadrakakis (2011), Cai, Pan, and Fu (2020), Dehkordi et al. (2012), Naser (2018), and Hore et al. (2016) used artificial data to construct AI models.For example, the authors in Mitropoulou and Papadrakakis (2011) replaced the fragility analysis of structural systems with a trained Neural Network (NN), reducing the computational efforts of complex buildings.Similarly, for investigating the postfire flexural capacities of RC beams, the authors in Cai, Pan, and Fu (2020) developed an NN optimized by a genetic algorithm as an alternative method of FEM analysis.

Motivation and significance of this study
This study explored the practical application of AI to structural engineering, where the structural behaviors and designs of doubly RC beams were investigated using multiple regression models.These AI models were trained on artificial data generated using Autobeam.This was an analytical software developed in Nguyen and Hong (2019) based on strain compatibilities and transformed sections.Big data is obtained by analyzing multiple beams with randomly selected input parameters (b, h, ρ t, ρ c , L, f y , f c ', M D , M L ).Hence, reverse designs were directly obtained from trained models instead of being obtained using iterative procedures.This approach provided mathematical optimization based on computational statistics.A mathematical model of structural data was built using regression algorithms to predict structural behaviors and designs.In addition, useful reverse engineering scenarios were established using five Gaussian Process Regression (GPR) models, including five different kernel functions Rational Quadratic, Exponential, Squared Exponential, Matern 5/2, and Matern 3/2 (Mathworks 2011).Moreover, overfitting was prevented by applying cross-validation schemes provided by MATLAB in training.
This study implemented five GPR models for beam designs, where appropriate models for each output were selected based on the accuracy represented by MSEs.The GPR models required regression options such as kernel functions, basic functions, and fit methods, to be defined, whereas artificial neural networks required the users to establish the numbers of layers, neurons, and epochs (Hong 2019).The training problems caused by insufficient inputs were addressed by applying a Chained training scheme with Revised Sequence (CRS) technique based on a feature selection-based greedy algorithm.The proposed algorithm determined features that significantly affected the outputs for training GPR models using feature selection functions.As the results, the training qualities of the AI models were enhanced, providing structural design based on reverse analysis for engineers.MATLAB provides 23 feature selection methods, among which this study explores F-test, RReliefF, and Neighborhood Component Analysis (NCA), which aid the design of doubly RC beams.The considered design parameters are listed in Table 1.

Evaluation parameters for regression models
The present study adopted Mean Square Errors (MSE) and Coefficient of Determination (R 2 ) calculated from validation datasets as the recommendations from MathWorks (2021a) to evaluate the accuracies of regression models.According to Botchkarev (2019) and Naser and Alavi (2020), MSE and R 2 were calculated as (1) and (2).
Where: Validation datasets are a portion of big data which is used to examine accuracies of fitting models as well as preventing overfitting phenomenon (MathWorks 2021b).In the present study, the cross-validation technique was adopted to validate regressions, Curvature ductility ε rt_0.003 Tensile rebar strain at concrete strain of 0.003 ε rc_0.003 Compressive rebar strain at concrete strain of 0.003 Δ imme (mm) Immediate deflection due to live load, M L Δ long ( Coefficient of Determination allowing ML models to use all data for training and testing.The procedures can be conducted using the following steps: Step 1: Dividing big data into k folds of data named as S 1 , S 2 , . .., S k .
Step 2: Training ML models using data from k-1 folds S 1 , S 2 , . .., S k-1 .This model is, then, tested by data in fold S k .
Step 3: Training ML models using data from k-1 folds S 1 , S 2 , . .., S k-2 , S k .This model is, then, tested by data in fold S k-1 .
Step k + 1: Training ML models using data from k-1 folds S 2 , S 3 , . .., S k-1 , S k .This model is, then, tested by data in fold S 1 .
Step k + 2: Testing errors in k steps from Step 2 to Step k + 1 are averaged.
Validation parameters (MSEs) shall be calculated using normalized data due to its scale-dependent characteristics (Naser and Alavi 2020).These values provide direct a quantification of fitting degrees, enabling users to assess training easily.However, it should be noted that qualities of a regression model shall be evaluated based on errors in practical designs.These errors represent the mismatches between values of preassigned parameters and those provided by AIbased designs.Therefore, errors need to be controlled strictly, ensuring the usability of designs.Detailed evaluations of AI-based models in designs can be found in Section 3.3.

Importance of feature selDection
This study evaluated the effects of one parameter on another by using feature scores, thus determining appropriate inputs for each output.Note that the features that significantly affect the outputs should be used, whereas those that are less influential can be omitted without significantly degrading the training quality.Furthermore, nonrelated parameters should be excluded from the input to avoid the generation of unnecessary noise in the mapping procedures.A simple example of the improvements obtained by selecting reasonable inputs is presented below.

Example:
Let us assume the following function: where x; y; z; t 2 1 : 1 : 100 f g It can be observed that x and y are the most influential parameters in calculating f, whereas z has negligible contribution.Furthermore, f and t are not related.A set of 500 datapoints generated using (3), normalized from 0 to 1 for training, is listed in Table 2(a).Table 2(b) illustrates four Matern 5/2 GPR models, which are trained on normalized data, mapping different input combinations to one output (f).The results obtained from the 1st and 2nd models indicate that the accuracy increases when a nonrelevant parameter (t) is removed from the inputs, showing that using more inputs does not guarantee better training.In contrast, the MSEs of the 3rd and 4th models indicate that the training accuracy degrades slightly when a less influencing feature (z) is omitted; however, removing an important index (x) considerably damages the fitting quality.

F-test algorithm
The importance of features is weighted by the probabilistic values (P-values) of the F-test statistics.The F-test examines two models, where the first model predicts the considered output Y from a constant and the second model maps that output from the same constant and one input X.Then, the effect of X on Y is evaluated from the differences between the MSEs of the two models.X plays a crucial role in calculating Y if the second model shows a significant improvement compared with the first one, and vice versa.The F-test can only capture linear relations between two parameters, ignoring higher-order correlations.

NCA algorithm
NCA is an evaluation method based on the K Nearest-Neighborhood (KNN) analysis.KNN predicts unseen data based on the closest points in big data.Let us consider two datapoints, A (x; y; z; t 2 1 : 1 : 100 f g) and B (x 1 ; y 1 ; z 1 ), where x and y are the inputs, whereas z is the output.Then, a simple distance function between A and B can be derived as (4).
In (2), z feature is omitted as it is outputs.The distances calculated by (4) treat all inputs equally.Therefore, weight factors are added to (5), emphasizing the importance of the key features.
The NCA feature selection algorithm adjusts the weight factors to maximize the probability that the outputs predicted using the closest points match the target values (Navot et al. 2005).

RReliefF
The RReliefF algorithm scores the features by investigating the probability of resulting in two different outputs from two nearest inputs.This method is modified using the feature selection procedure of the classification problem, which is suitable for ranking features in distance-based supervised models (Mathworks 2011; Robnik-Šikonja and Kononenko 1997).

Feature scores of structural parameters
In Tables 3 and 4, feature selection scores that affect each other among the 12 design parameters are identified based on the F-test and NCA, respectively, determining dominant features in predicting the design parameters, including section dimensions and rebar ratios, based on the F-test, RReliefF, and NCA.
Figure 1 shows the selected features and their scores of the 12 parameters obtained for 20,000 structural datasets based on the NCA method, exploring the relations between the indices.For example, the beam dimensions (b and h) can be omitted in predicting both tensile (ρ t ) and compressive (ρ c ) rebar ratios due to the low feature selection scores (purple boxes in Figure 1).In contrast, rebar ratios should be included as inputs in trainings on section dimensions because of the high feature scores (blue boxes in Figure 1).A detailed investigation of the feature scores is presented in Section 3.
Figures 2-5 compare the feature scores for predicting the section dimensions and rebar ratios using 20,000 datasets based on F-test, NCA, and RReliefF.Some differences are observed, requiring exploration of the most appropriate method for the considered structural datasets.
As shown in Figures 2-5, the most important features recommended by F-test, NCA, and RReliefF methods are used to train Matern 5/2 GPR models on four outputs (b, h, ρ t , ρ c ). Various GPR models are trained on 5,000 structural data, simultaneously, to select promising models interactively.The diagnostic measures (Figures 2-5) of the training results are reflected to identify the validated models.The results are compared to identify the most reliable feature selection method for training AI models on structural datasets for the design of doubly RC beams, indicating that NCA is better for training GPR models on structural data than F-test and RReliefF methods in all training results (Figures 2-5).In addition, Figures 2-5 show comparisons between training performed using inputs based on NCA scores and those including all features.NCA captures important indexes, approximating MSE compared with those obtained from training using all features.In this case, the training performed using all 11 features places all parameters as input to predict one output (11 inputs to one output), which is often impossible in reverse scenarios when there is more than one output.Furthermore, NCA-based results demonstrate better accuracy in training on b, ρ t , ρ c (Figures 2, 4 and 5) than those observed when all features are given as inputs.

Introducing CRS
The CRS concepts are implemented in the present study, with an aim to increase the accuracy of AIbased models by adding features obtained from previous training in the next training.In doing so, this method overcomes the training difficulties incurred by the insufficient input by using features determined from predecessor training as inputs for successor steps.An example is presented to demonstrate the enhancements of training obtained by implementing CRS.Example: Let us assume the following function: where x; y; z 2 1 : 1 : 100 f g Problem: Given f ¼ 100; 000 and x ¼ 50, find appropriate values of y and z.
Two Matern 5/2 GPR models are trained using the 500 normalized datasets presented in Table 5, where one model is built according to the PTM method and the other is built using the CRS procedure, as shown in Figure 6.
The given problem provides only two known variables (x 0 ¼ 50 and f 0 ¼ 100; 000), whereas three knowns are required to determine the exact solutions.A PTM model, therefore, produces an error of 1.164% because its outputs (y 1 ¼ 27:8725 and z 1 ¼ 53:0767) are predicted using two preassigned inputs (x 0 ¼ 50 and f 0 ¼ 100; 000).In other words, AI results in errors because of the lack of information.In contrast, a CRS model is developed in two steps.First, a value of y 2 ¼ 27:8725 is obtained from two knowns (x 0 ¼ 50 and f 0 ¼ 100; 000).Then, a value of z 2 ¼ 55:4034 is predicted from three knowns, including two preassigned parameters (x 0 ¼ 50 and f 0 ¼ 100; 000) and one obtained in the first step (y 2 ¼ 27:8725Þ.A highly precise value of z is obtained, with a negligible error of 3.3E-4%, because AI contains sufficient data (three knowns) to calculate z in the second step.In conclusion, CRS enhances the accuracy of AI-based designs by using the outputs of predecessor steps as inputs of successor steps, adding design constraints to AI as training continues.

CRS in beam design problems
Table 6 defines the reverse design scenario considered in the present study, where, first, AI determines four design parameters (b, h, ρ t, ρ c ) from eight preassigned inputs (ØM n, M u , µ Ø , L, f y , f c ', M D , M L ).Then, the parameters controlling the service criteria are calculated by Autobeam using the BS procedure, completing the practical design problem.This scenario requires engineers to directly control the capacities and ductility of beams, which are impractical to be achieved using conventional methods based on iterating forward calculations.
Table 7 shows the ranges of big data used for training GPR models.The skewness of structural datasets should be avoided to ensure training and design accuracy of regression algorithms.Training is easier if datasets are extracted by similar ranges used for obtaining feature scores.
Figure 7 shows the selected features and their scores of the 12 parameters obtained for 20,000 structural datasets based on the NCA feature selection method, exploring reasonable sequences for solving the reverse design scenarios mentioned in Table 6.As shown in Figure 7(a), the section dimensions present low scores (purple boxes) in predicting the rebar ratios, whereas high scores of rebar ratios (blue boxes) indicate their significant affections in predictions of section dimensions.Therefore, rebar ratios should be determined first, being employed as inputs for calculating the section dimensions.Similar investigations are presented in Figure 7(b,c), which indicate that ρ t must be predicted before ρ c and h must be determined before b.Finally, the most appropriate sequence is ρ t , ρ c , h, b.In the present study, a greedy algorithm that automatically simulates the abovementioned investigations for the sequence determinations is developed.Furthermore, the best input combinations and suitable kernel functions for GPR models are determined.The proposed method is applied in the following four steps.
Step 1: Appropriate input combinations and GPR kernel functions are determined by attempting multiple trainings.These trial trainings are conducted using 2,000 datasets to avoid unnecessarily long running time.M u ,  µ Ø ,L, f y , f c ', M D , M L ) are scored based on the NCA algorithm (Table 8(a)).These eight combinations are then defined, including one to eight inputs.In each combination, higher score indexes are selected; for example, in the first combination, training uses only one parameter (µ Ø ) with the highest score (6.49), and then the second combination adds the second-highest scored feature (ϕM n -3.84) to the inputs.Finally, in the eighth combination, all eight parameters are included as inputs.Table 11.Trial training for determining input combination when ρ the output of Step 1.
• Step 1(d): Steps 1(a)-1(c) are repeated when h, ρ t , and ρ c are individually assumed as an output of Step 1. Trainings with these assumptions are presented in Tables 9-11.In this case, ρ t is selected as the output for Step 1, six inputs (µ Ø , f c ', f y , ØM n , M L , M U ) are selected, and function Squared Exponential is applied.
Step 2: ρ t can be used as an input of Step 2, as it is predicted in Step 1.Thus, Step 2 can use nine parameters (ØM n, M u , µ Ø , L, f y , f c ', M D , M L , ρ t ) as inputs.The  8-11).Step 3: This step uses 10 parameters (ØM n, M u , µ Ø , L, f y , f c ', M D , M L , ρ t , ρ c ) as inputs because ρ c is predicted in Step 2. The testing procedures in Steps 1 and 2 are repeated (Table 14), indicating that h is the output of Step 3; the most suitable inputs are ØM n, M u , µ Ø , L, f y , f c ', M D , M L , ρ t , ρ c ; and function Squared Exponential must be applied.
Step 4: this step uses parameters (ØM n, M u , µ Ø , L, f y , f c ', M D , M L , ρ t , ρ c , h) as inputs because h is predicted in Step 3. The testing procedure in Steps 1-3 is repeated, suggesting that ten inputs (ØM n, M u , µ Ø , L, f y , f c ', M D , ρ t , ρ c , h) and the Squared Exponential function must be applied.
The final training scheme is presented in Table 15 and Figure 8.The models for the designs are trained on 100,000 datasets using determined training schemes (Table 16).

Design results
Table 17 shows four design cases that preassign a safety factor of 1 (M u = ϕM n ) and ductility of 6.The proposed procedures satisfy reverse designs without conducting iterations as those in conventional methods.Figure 9 illustrates beam designed in Table 17(a), it is noted that arranged rebar ratios and section dimensions are rounded up for satisfying constructability.This leads to a slight increasement in a design strength and a curvature ductility.Finally, a beam design is produced using preassigned design strengths and ductility based on (Standard 2019).The reliability of AI-based design is proved by a probabilistic investigation (Figure 10), where 91 designs are conducted, showing that all errors are below 0.5%.In conclusion, accurate predictions of design with reverse parameters can be performed easily and rapidly.
In Table 18, significant errors in designs are found for design moment capacity (ϕM n ) of 35,000 kN•m, because GPR models are not trained    in this range.Figure 11 shows that the smallest errors are generated when the preassigned moment capacities are below 17,000 kN•m.The models start to lose their accuracies rapidly when inputs are placed near the boundaries of big dataset, as shown in Figure 12 (design moment capacity (ϕM n ) of 15,000 kN•m).Finally, large errors occur if inputs lie outside the datasets.Considerable errors occur with preassigned values that are not available in the big data.It is necessary to train ML models on extended ranges of structural dataset.However, it needs to be adjusted automatically hierarchically (adjusting inputs to minimize design errors to improve accuracy) when inappropriately preassigned inputs are implemented when cross-relations between inputs are not known.

Data updating procedures
Figure 12 clearly indicates that performances of AI models heavily rely on training data.Therefore, an AI-based design method shall provide sufficient flexibility for the change of design ranges and regulations because it is impractical to generate a bigdata which cover all ranges  of parameters and codes.In the proposed GPR-CRS training and designing procedures, all the modifications in design requirements and parameters must be reflected in training bigdata.Results in Figure 13 show that the introduced processes still provide reliable accuracies (less than 0.7 errors) when design code is changed from ACI 318-19 (Standard 2019) to ACI 318-14 (Standard 2014).

Conclusion
In this study, comprehensive designs of doubly RC beams satisfying preassigned flexural capacity and ductility are provided by CRS-based GPR models and BS procedures.Feature selection algorithms are used in a greedy algorithm to determine reasonable input combinations and output sequences.(6) The proposed models should not be applied to extrapolation designs as considerable errors are expected when out-of-range inputs are given.In addition, users should also be aware of the input conflicts, which occurs when multiple parameters are preassigned, violating correlations between indexes.

Notes on contributors
Dr. Won-Kee Hong is a Professor of Architectural Engineering at Kyung Hee University.Dr. Hong received his Masters and Ph.D. degrees from UCLA, and he worked for Englelkirk and Hart, Inc. (USA), Nihhon Sekkei (Japan), and Samsung Engineering and Construction Company (Korea) before joining Kyung Hee University (Korea).He also has professional engineering licenses from both Korea and the USA.Dr. Hong has more than 30 years of professional experience in structural engineering.His research interests include new approaches to construction technologies based on value engineering with hybrid composite structures.He has provided many useful solutions to issues in current structural design and construction technologies as a result of his research combining structural engineering with construction technologies.He is the author of numerous papers and patents, both in Korea and the USA.Currently, Dr. Hong is developing new connections that can be used with various types of frames, including hybrid steel-concrete precast composite frames, precast frames, and steel frames.These connections would contribute to the modular construction of heavy plant structures and buildings as well.He recently published a book titled "Hybrid Composite Precast Systems: Numerical Investigation to Construction" (Elsevier).
Tien Dat Pham received a Master's degree from Seoul National University of Science and Technology.He is currently enrolled as a Ph.D. student in the Department of Architectural Engineering at Kyung Hee University, Republic of Korea.His research interest includes structural analyses using numerical methods considering material plasticity.Furthermore, his studies also involve developing optimal designing methods based on machine learning and genetic algorithms.

Figure 1 .
Figure 1.Feature selection based on feature scores.

Figure 2 .
Figure 2. Training results based on different feature selection algorithms when b is output.

Figure 3 .
Figure 3. Training results based on different feature selection algorithms when h is output.

Figure 4 .
Figure 4. Training results based on different feature selection algorithms when ρ t is output.

Figure 5 .
Figure 5. Training results based on different feature selection algorithms when ρc is output

Figure 7 .
Figure 7. CRS sequences based on feature scores.
the inputs, therefore, need to be calculated again using NCA.Then, the procedures individually assume b, h, and ρ c as the output of Step 2, testing nine combinations based on feature selections with five different GPR kernel

Figure 10 .
Figure 10.Statistic results of multiple designs.

Figure 9 .
Figure 9. Illustration of beam designed in Table 16(a).

Figure 11 .
Figure 11.Design errors of AI at various preassigned moment capacities.

Table 2 .
Example of feature selection.(a) 500 randomly generated dataset.

Table 3 .
Features and their scores of the 12 parameters based on the F-test method.

Table 4 .
Features and their scores of the 12 parameters based on the NCA selection method.

Table 5 .
Data for CRS example.
• Step 1(a): assuming that b is the output for Step 1, eight available inputs (ØM n, The finest results of from eight combinations of inputs are compared in Table 8(a), indicating the best training qualities when b is assumed as the output of Step 1.

Table 8 .
Trial training for determining the input combination when b is the output of Step 1.

Table 9 .
Trial training for determining the input combination when h is the output of Step 1.

Table 10 .
Trial training for determining input combination when ρ t is the output of Step 1.

Table 12 .
Trial training for determining output for Step 1 (summarizing results of Tables

Table
Trial training for determining output for Step 2.

Table 14 .
Trial training for determining the output for Step 3.

Table 15 .
Final training scheme for CRS.

Table 16 .
Training results of models for designs.

Table 13
u , µ Ø , L, f y , f c ', M D , M L , ρ t ) by the Squared Exponential model.

Table 18 .
Design using out of ranges moment values.
The CRS technique is implemented in training and design processes, employing outputs of one step as inputs for its successors.In doing so, more constraints are added after each prediction, enhancing the accuracy as training continues.(5)A greedy algorithm that comprises multiple test trainings based on feature selection scores is proposed, determining the most appropriate input combinations, sequences, and GPR kernel functions for regression models.