Ranking structural analysis software applications using AHP and Shannon’s entropy

ABSTRACT This paper aims to compare the ranking of ten structural analysis software applications in terms of six factors: Standardization, Reliability, Longevity, Usability, Price, and Functionality. The study surveyed structural design engineers from various countries around the world, collecting their opinions on the relative importance between the six factors. The respondents were also asked to score ten structural analysis programs for each of the six factors. The factor weights were derived using two methods: Analytic Hierarchy Process (AHP) and a hybrid method that combines AHP with Shannon’s entropy. The weighted average of the scores was then used to rank the preference of the programs. The results indicate that the factors of most concern for users are Reliability and Functionality, while Price was of the least concern. Significant differences in preferences were also between certain groups based on location and years of experience. The programs can be classified into three groups: one program that is highly favored, a set of programs that are “Above Average”, and a set of programs that are “Average”.


Introduction
Computer-Aided Design (CAD) has become a common tool for engineering professionals. There are several programs available on the market for structural analysis and design, with various degrees of functionality and compatibility. Users may use more than one program for analysis, whether it is because of the type of structure, the available software provided by their employer, or personal preferences.
The typical licensing options for these programs tend to be prohibitively expensive for individual users, which may result in users having to communicate with the software developer through their Information Technology or Procurement departments. As such, a software's market share might not be an accurate indicator of user preferences. The motivation for this study was to support structural analysis software procurement decisions that want to take the enduser into consideration, as well as to provide the software developers with a global view on end-user needs and sentiments.
The objectives of this study were on two levels: identifying end-users' priorities in structural-analysis software, and compare the performance of current products on the market on those priorities. This would not only be an indicator of program desirability among users but also shed light to developers on user priorities.
The authors selected ten structural analysis applications based on user feedback posted across the internet, journal articles, and reviews in civil engineering forums. The evaluation criteria were selected based on the factors that were commonly used in various areas of research to analyze the ranking of other software packages.

Literature review
The earliest historical evidence of structural engineering can be seen in the great pyramids of Egypt, around 2700 BC (Kirby 1957). Many scientists and architects such as Archimedes, Galileo, Hooke, and Newton carefully examined and improved this field through later historic periods. The early 20 th century saw the introduction of finite element analysis (Turner et al. 1956), a computationally intensive method of analyzing complex structures. With the advent of computers, finite element analysis became a practical method to analyze complex structures using software (MacNeal and McCormick 1971). Today, there are numerous software packages available for the analysis and design of various structural members, some more limited in scope than others. The authors compiled a "top ten" list of commercial software packages used in the industry (Table 1) from articles and discussions on websites dedicated to the topic (Structural Engineering Blog, n. d.;Topics | The Structural World, n.d.;Uihlein 2013).
While there is no specific literature on identifying evaluation criteria for structural analysis software, there have been similar studies on criteria selection for software applications (Benlian and Hess 2011), vendor selection (Weber, Current, and Benton 1991), and even foreign policy (Ministry of Foreign Affairs of Denmark 2016). One such study (Sullivan, Malave, and Cheekoti 2004) explored the use of microscopic traffic simulation models using the following criteria for the evaluation: "System Requirements", "Ease of Coding", "Data Requirements", "Relevance/Accuracy of Performance Measures Reported in the Output", and "Versatility/Expandability". Another paper (Ruiz 2009) evaluated BIM software using "Preconstruction Criteria", "Construction Criteria", "Post construction Criteria" and "General Criteria". The "General Criteria" consisted of several sub-criteria that could be generalized for software evaluation, such as "Necessary Upgrades to the Company's System", "Operates in Preferred Operating Environment", "Recovery Mechanism Ensures Data Integrity to the Business Function Level", "Cost of the Implementation", "Quality of Help and Supporting Documentation", and "Tutorials and Other Learning Resources".
From the above literature, this study identified six criteria relevant to structural software analysis: • Standardization: The system uses standard equipment that is reliable, widely available, and applicable to a variety of uses (e.g. the hardware requirements for running the software), as well as compatibility and ease of transferring information to and from other commonly used design software. • Reliability: The program is reliable in normal use. • Longevity: The software is updated to remain compatible with operating systems, hardware, standards, and codes. • Usability: ease of use and training. • Price: Cost of the software or its license. • Functionality: The different types of analysis the software can run, as well as the variety of methods and standards available to the user.
Scoring for factors can be simplified using a Likert scale (Likert 1932). While applying parametric tests to an ordinal scale has been regarded as unsuitable (Jamieson 2004), there has been recent argument claiming the validity of this approach if the data is described properly (Sullivan and Artino 2013). As some criteria might be perceived to be more important than others, the criteria would need to be weighted differently to calculate a net score. One of the common (Bertolini, Braglia, and Carmignani 2006;Davies 1994;Russo and Camanho 2015;Wang et al. 2018) approaches is the Analytic Hierarchy Process (AHP) (Saaty 1990), where each criterion is weighted in pair-wise relative importance with respect to the others in a matrix. The resulting weights would then be checked by comparing the Consistency Ratio (CR) to a threshold of 0.1. The pair-wise weighing can use multiple evaluations through a group survey to minimize bias (Albayrak and Erensal 2004;Basak 2002;Zhang, Chen, and Chong 2004). Moreover, the decision and evaluation process itself (Andijani 1998;Melón, Aragonés Beltran, and Carmen González 2008;Podgórski 2015;Vidal, Marle, and Bocquet 2011) can lead to prioritization of given factors (Kumar and Dash 2014;Weber 1993) due to the subjective nature of assigning the weights.
Another, earlier method of weighing criteria was developed by Claude Shannon (1948) and has the advantage of producing objective weights. However, this method requires the assignment of a probability distribution of each factor. Al-Aomar (2010) developed an AHP-Shannon Entropy hybrid model that combines subjective and objective weights to produce an adjusted value. Current studies in multi-criteria decision-making include the use of fuzzy-AHP (Khan, Dweiri, and Chaabane 2016) and fuzzy-Shannon's Entropy (Khan et al. 2018).

Methodology
The aim of this study was twofold: identify users' priorities in structural analysis software across six factors (Standardization, Reliability, Longevity, Usability, Price, and Functionality), and survey how well each program (among the list of ten) performs for each factor. The study surveyed 147 professional structural designers from various countries around the world (Tables 2 and  3) through an online survey. The first part of the survey was a pairwise comparison between each of the six factors, asking the respondents to rate how the factors compared to each other in importance. The respondents were given a scale of nine options, the central being "both factors are equally important", and increasing in favor of one factor being more important than the other by two, three, four, or five times. The second part of the survey collected respondents' rating of how well each of ten software applications performed with respect to each of the factors. Each program was rated on a scale from poor to excellent (Poor, Below average, Average, Above average, Excellent, or N/A), with a "N/A" option implying unfamiliarity with the software.
Of the 147 complete responses collected, random two-thirds (98 responses) were used for analysis, and the remaining third (49 responses) were used to form a validation set.
The study used the non-parametric Chi-squared test to identify whether there was any significant preference in the responses for each of the questions (using a p-value of 0.05 for significance). The aim was to avoid falsely centralizing a distribution of responses that had no clear preference (i.e. was similar to random). The study also analyzed the bimodality of responses through a chi-squared test that excluded the "Average" or central response.
Two models were chosen in order to validate each other: AHP was chosen for its established popularity in research pertaining to the comparison between multiple criteria, while the hybrid Shannon's entropy model was used to provide validation for the AHP model through its objective derivation of criteria weights. Likewise, AHP was used as a cross-check for a novel and practical application of the hybrid Shannon's entropy model.
The averages of the responses comparing the relative importance of the six factors were used to develop the factor weights by means of AHP and the hybrid AHP-Shannon's entropy models. The average of the responses scoring each program on each of the factors was then used to calculate the weighted score of each program and rank them accordingly.

Analysis
A chi-squared test was used on the analysis set to test for significant preference and bimodality in the responses on the Average Relative Importance (ARI) between factors (Table 4). The test for preference significance compared the number of responses for each degree of importance against a uniform distribution. The bimodality test excluded the middle option ("both factors are equally important") then checked for significant preference between the responses that indicated any degree of importance of one factor over the other. Table 4 indicates that all responses show a degree of preference, but some responses do not show any statistically significant preference when only considering responses from participants with more than ten years of experience. There is even less significance of preference from respondents with less than ten years of experience. When excluding the central option ("both factors are equally important") to account for central tendency and bimodality, the results also indicated some factors lacked significant preference.
Similarly, a chi-squared analysis was conducted on the responses evaluating the performance of each software for each factor. All the factors for all the programs showed that there was significant preference (i.e. the distribution of responses was different from a uniform distribution). However, to control for central bias, a chisquared test was conducted for the significance of responses above and below the central option (to test whether the sum of the "Poor" and "Below average" responses was statistically significant from the sum of the "Above average, and Excellent" responses). Table 5 indicates that PROKON, RAM Structural, and RISA did not have any significant preference above or below average.  A Mann-Whitney U-test was then conducted on the data set grouped by years of experience (<10 years vs ≥10 years). While there was no significance between the groups regarding the relative importance between factors, certain criteria for ETABS, SAFE, and RISA showed significance between the groups (Table 6). Further analysis indicated that respondents with less than 10 years of experience scored those criteria at least one degree of preference higher than respondents with more than 10 years of experience.
The Kruskal-Wallis H test was then conducted on the analysis set to determine if there was any significance across responses grouped by location. Table 7 summarizes the significant factors and indicates that respondents from some locations had significantly different opinions on Robot Structural, RISA, S-Concrete, and RAM Structural, with Iceland, UK, Germany and USA tending to provide a higher score. The average score for each program's criteria score was computed by assigning a numerical scale to the ratings (1 = Poor, 5 = Excellent). Table 8 presents the mean score for each program's evaluation criteria.
The authors used the Average Relative Importance (see Table 4) from the factor comparison responses to conduct the AHP analysis. The weights of the analysis Table 4. Chi-squared test p-values for a preference in the relative importance between factors (statistical significance of p < 0.05 in bold).   set had a Consistency Ratio of 0.014, while the validation set resulted in similar weights with a Consistency Ratio of 0.025. Both Consistency Ratios were acceptable (less than the threshold of 0.1). The weights derived using the AHP-Shannon's entropy hybrid model were different, with noticeably less weight given to Reliability and Price, and more given to Functionality (Table 9).

Results
The score for each software application was calculated using the sum of the product of the derived weights and the factor scores for each program. Table 10 shows the ranking of the programs using four different methods. The first three use the factor weights derived from the AHP method, while the last uses the factor weights derived from the AHP-Shannon hybrid method. The first method, "By Mean", uses the mean scores (Table 8) with the average score AHP weights (Table 9). The second method, "By Median", is used to better represent the ordinal nature of the scores, with the AHP weights recalculated using the median values. The third method, "Excluding Average", uses the mean of the scores but excludes the central value (the "Average" choice out of the scale of "Poor, Below average, Average, Above average, and Excellent") in an attempt to adjust for central bias, and the AHP weights are recalculated using these values. While this would make the significance of non-central choices more pronounced, it would not prevent the centralization of the result in bimodal distributions. The fourth method uses the mean of the scores (Table 8) and the AHP-Shannon hybrid weights (Table 9).
In all methods, there was a noticeable pattern: • SAP2000 was noticeably ahead of the other programs, • ETABS, SAFE, ADAPT and STAAD.Pro scored closely and above the average • S-Concrete scored slightly above average • Robot Structural, RISA, PROKON, and RAM Structural, scored close to the average The analysis was conducted on the validation data set and produced similar results (Table 11).

Discussion
The factor weights derived by AHP indicated respondents highly valued Functionality and Reliability, while Price was of least importance. The weights derived using the AHP-Shannon's entropy hybrid model were different, with more pronounced importance given to Functionality, some more importance given for Longevity, and little importance given for Price (Table  9). This should be further investigated in future studies to analyze the extent of involvement that structural design engineers have in their company's procurement decisions for the structural analysis software, whether there is sensitivity to Price for other personnel involved in the procurement process, and to what  extent this has an impact of the choice of software used. Another point for further study is that Standardization did not stand out as an important criterion, even though SAP2000, ETABS, and SAFE are all provided by the same company, and Robot Structural is provided by the company that offers the popular drafting program AutoCAD (Senagala 2004). ETABS garnered a noticeable difference in responses between respondents with less than 10 years of experience and those with more than 10 years of experience (Table 6), with the former group scoring the criteria at least one degree of preference higher than the latter group. This warrants further investigation to identify the reasons for this distinction. On the other hand, comparing the responses across groups by location did not show any significance in responses for ETABS. Instead, significant differences appeared in Robot Structural, S-Concrete, RISA, and RAM Structural. In particular, respondents in the USA scored Robot Structural significantly higher in comparison with respondents in Saudi Arabia, Turkey, and UAE (Table 7). This should be investigated in future studies.
The ranking of the programs using the validation set closely matched those of the analysis set. All four methods of analysis produced similar rankings, despite the AHP-Shannon's hybrid method having some different weights. The ranking of the latter four programs in conjunction with the results of the non-central significance test (Table 5) indicates that there is a strong central bias for these programs. Moreover, less than 4% of respondents selected the "N/A" score option, leading the authors to suspect that respondents tended to score "Average" for a program they were unfamiliar with instead of scoring it as "N/A". Furthermore, none of the respondents left any comments in the survey, likely because the comments field was optional and placed at the end of the survey.

Conclusions
The criteria weights for evaluating structural analysis programs indicated that respondents valued Functionality the most and Price the least. Notable differences in responses between groups are that younger professionals tended to score ETABS one rank higher than more experienced structural engineers, and that the USA scored Robot Structural significantly higher in comparison with respondents in the Middle East. Ranking by the AHP and AHP-Shannon hybrid models produced similar results that matched the validation data set results. The ten programs can be ranked in three distinct categories: significantly above average (SAP2000), above average (ETABS, SAFE, ADAPT, STAAD.Pro, and S-Concrete), and average (Robot Structural, RISA, PROKON, and RAM Structural). The latter group had significant central tendencies, and should be investigated in future studies that would better identify which programs respondents are unfamiliar with. The results of this study could help provide a global view to structural analysis software developers and software procurement managers on end-user priorities, as well as a global view on end-user perceptions of the current products on the market.

Data availability statement
All data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request.

Disclosure statement
No potential conflict of interest was reported by the authors.  analysis, construction productivity and management. He