Theoretical and empirical validation of software trustworthiness measure based on the decomposition of attributes

From the perspective of attribute decomposition, there are a variety of software trustworthiness metric models. However, little attention has been paid to using more rigorous methods and to performing theoretical validation. Axiomatic methods formalise the empirical understanding of software attributes through defining ideal metric properties. They can offer precise terms for the software attributes' quantification. We have utilised them to assess software trustworthiness on the basis of attribute decomposition, presented four properties, constructed a software trustworthiness measure (STMBDA for short). In this paper, we extend the set of properties, introduce two new properties, namely non-negativity and proportionality, and perfect substitutability and expectability. We verify the theoretical rationality of STMBDA by demonstrating that it conforms to the new property set and the empirical validity by evaluating the trustworthiness of 23 spacecraft software. The validation results show that STMBDA is able to effectively assess the spacecraft software trustworthiness and identify weaknesses in the development process.


Introduction
Software is increasingly ubiquitous in our everyday lives and plays an indispensable role in the function of our society. A system failure caused by the software operation (directly or indirectly) can have very severe consequences, resulting in not only monetary, time, or property losses, but also casualties (Wong et al., 2010(Wong et al., , 2017. As a result, software trustworthiness has attracted widespread attention in recent years. Its measurement has become a hot topic among researchers (He et al., 2018;Maza & Megouas, 2021). Software trustworthiness is able to be represented through many attributes (Chen & Tao, 2019;Gupta et al., 2021;Steffen et al., 2006); in this paper, these attributes are referred to as trustworthy attributes. Trustworthy attributes are often at levels that cannot be directly measured, so they are further decomposed into sub-attributes. A lot of software trustworthiness metric models on the basis of attribute decomposition are established. However, few studies have focused on more rigorous methods for measuring software trustworthiness and theoretically validating these measures. Strictly measuring the trustworthiness of software is able to assist with the assessment and improvement of software trustworthiness. Theoretical validation is a necessary activity for defining meaningful metric models and a required step for empirical validation of them (Srinivasan & Devi, 2014). Theoretical validation methods can be divided into two categories (Srinivasan & Devi, 2014): one is on the basis of the measurement theory (Briand, Emam, et al., 1996;Zuse & Bollmann-Sdorra, 1991), and the other is in the view of axiomatic approaches . Axiomatic approaches, which formally describe the empirical understanding of software attribute by defining desired metric properties, have been used to measure internal attributes, such as size, inheritance, complexity, and so on (Meneely et al., 2012).
To make software trustworthiness measurement more stringent, we once used axiomatic approaches to assess software trustworthiness on the basis of attributes. We extend this work to utilise axiomatic approaches to evaluate software trustworthiness from the point of attribute decomposition, give the expected properties of software trustworthiness measurement in the view of attribute decomposition, including monotonicity, acceleration, sensitivity, substitutivity and expectability, and establish a software trustworthiness measure (STMBDA) (Tao et al., 2015). In this paper, we complete the property set and introduce two new properties, namely non-negativity and proportionality. The reasons for introducing these two properties are as follows The measure result of software trustworthiness cannot be negative and non-negativity is used to describe this case. Software trustworthiness is the user's subjective recognition of the objective quality of software. The quantification of software trustworthiness needs to truly reflect users' approval. A software that is trusted requires that individual attribute (sub-attribute) values should not be too low, proportionality is to characterise this situation. At the same time, the substitutivity and expectability are improved. The improved substitutability adds two constraints to better depict the substitution. The first is that substitutivity between critical and non-critical attributes should be not only harder than that between critical attributes but also more difficult than that between non-critical attributes. The second is that the substitutivity between sub-attributes should be harder than that between attributes. The improved expectability adds user expectations related to sub-attributes. We verify STMBDA's theoretical rationality by proving that it satisfies the new property set. Its empirical validity is verified by utilising it to measure the trustworthiness of 23 spacecraft software. Compared with some established software trustworthiness metric models, STMBDA is able to better evaluate the software trustworthiness.
The rest of the work is organised as follows. We present the related works in Section 2. The expected measurement properties from the point of the attribute decomposition are described in Section 3, including the properties presented in Tao et al. (2015), the newly introduced properties and the improved properties. STMBDA is introduced in Section 4 and we carry out its theoretical validation in the same section. We give STMBDA-based measurement process in Section 5. We conduct an empirical validation of STMBDA by a real case in Section 6. The comparative study is presented in Section 7. We end the paper with the conclusion and future work in the last section.
Next, we select some of the typical models mentioned above for a detailed introduction. Shi et al. (2008) first build a software dependability indicator system on the basis of the existing research results and then give a software dependability metric method by combining AHP and the fuzzy synthesised evaluation method. Ding et al. (2012) develop a software trustworthiness evaluation model on the basis of evidential reasoning. In this method, they also introduce two discounting factor estimation approaches to measure the reliability degree of evaluation result. Medeiros et al. (2020) utilise machine learning algorithms to obtain the knowledge related to vulnerabilities from software metrics extracted from source code of some representative software projects. Devi et al. (2019) analyse the effects of class imbalance and class overlap in traditional learning models, and their research results can be used to classify data sets of software trustworthiness assessment with class imbalance. Xu et al. propose a QoS prediction model. In this model, neural networks and matrix factorisation are combined to perform non-linear collaborative filtering on the latent feature vectors of users and services (Xu et al., 2021). Lian and Tang (2022) give an API recommendation method on the basis of neural graph collaborative filtering technique. The experimental results show that the performance of this method is better than the most advanced methods in API recommendation. Falcone and Castelfranchi (2002) study the relationship between trust and control. Muhammad et al. (2018) give the software trustworthiness rating strategy, which utilises the system test execution completion score to measure software trustworthiness. Yang et al. present a Social-to-Software software trustworthiness framework. This framework consists of a generalised index loss, the ability trustworthiness measurement solution, basic standard trustworthiness measurement solution and identity trustworthiness measurement solution (Yang et al., 2018). Wang et al. establish an updating model of software component trustworthiness. The component's trustworthy degree is calculated according to users' feedback, the updating weight is obtained based on the number of users, and the final trustworthy degree of the system is computed by the Euler distance (B. H. Wang et al., 2019). Xie et al. (2022) present an approach to cover specifications of nominal behaviour and security, in which combined validation is utilised to verify the SysML models' nominal behaviour and Fault Tree Analysis for security analysis.

Properties of software trustworthiness measures based on attribute decomposition
Trustworthy attributes are divided into critical and non-critical attributes. Critical attributes are the trustworthy attributes that the software must have, and others are called non-critical attributes (Tao & Chen, 2009). Using the same notations as Tao and Chen (2009), denote the value of critical attributes by y 1 , . . . , y m , and the values of non-critical attributes by y m+1 , . . . , y m+s . Assuming that there are n sub-attributes to form trustworthy attributes, set their values as x 1 , . . . , x n , respectively. Let y i (1 ≤ i ≤ m + s) be attribute measure functions about x 1 , . . . , x n , and T be the software trustworthiness measure function of y i (1 ≤ i ≤ m + s). The expected measure properties in the view of attribute decomposition are as follows. Among them, monotonicity, acceleration, and sensitivity are properties that have been proposed in Tao et al. (2015), non-negativity and proportionality are newly introduced properties, and substitutivity and expectability are improved properties.
(1) Non-negativity Non-negativity implies that the evaluation result of software trustworthiness is nonnegative.
(2) Proportionality Proportionality refers to the assumption that there should be an appropriate proportionality between attributes (sub-attributes). For example, supposing that critical attributes of a certain type of software consist of resilience and survivability, high-confidence software of this type requires both good resilience and high survivability. Very good resilience and low survivability or very high survivability and poor resilience are not appropriate. There are similar reasons for the proportional suitability of sub-attributes. (

3) Monotonicity (Tao et al., 2015)
( That is, the increased value of the attribute does not cause a decrease in the software's trustworthy degree, and the increase in the sub-attribute value does not result in a lower attribute value. ( Acceleration is used to characterise the rate of change of a attribute (sub-attribute). When only one attribute y i is increased and the other attributes y j (j = i) remain unchanged, the efficiency of using attribute y i will decrease. For the sub-attributes, there are similar explanations. (

5) Sensitivity (Tao et al., 2015)
( where w i is the ith attribute's weight, w i,k is the weight of kth sub-attribute that constitutes the ith attribute, f 1 is a function of y i and w i , f 2 is a function of x k and w i,k . Sensitivity represents the percentage change in software trustworthiness (attribute value) caused by the percentage change in the attribute value (sub-attribute value). They should be nonnegative and associated with the attributes (sub-attributes) and their weights. Furthermore, the software trustworthiness should be more sensitive to the smallest critical attribute relative to its weight. The reason is that its improvement is able to greatly increase the software trustworthiness ; therefore, the percentage change in its value will result in a relatively larger percentage change in the software trustworthiness.
σ rt are applied to indicate the difficulty of substituting between attributes, complying with 0 ≤ σ rt ≤ ∞. The smaller σ rt , the harder it is to replace the rth and tth attributes.
(2) (∃c 7 , σ ikl are applied to represent the difficulty of replacement between the sub-attributes that constitute the ith attribute. Similarly, σ ikl meets 0 ≤ σ ikl ≤ ∞, the smaller σ ikl , the harder it is to replace the kth and the lth sub-attribute. Property 1 and property 2 indicate that the attributes are able to be replaced for each other to a certain extent, so can the sub-attributes.
Property 3 states that the substitutivity between critical and non-critical attributes should be not only harder than that between critical attributes but also more difficult than that between non-critical attributes. Property 4 means that the substitutivity between sub-attributes should be harder than that between attributes.
(7) Expectability where x 0 and y 0 are the user's minimum expected values for sub-attributes and attributes, respectively. Expectability implies that if all sub-attributes (attributes) meet the user's expectations, then the attribute trustworthiness (software trustworthiness) should also achieve the user's expectations and be less than or equal to the maximum value of all sub-attributes (attributes).

A software trustworthiness measure based on the decomposition of attributes and its theoretical validation
In this section, we introduce the software trustworthiness measure on the basis of the attribute decomposition constructed in Tao et al. (2015) and theoretically validate it by demonstrating that it conforms to the properties described in Section 3.
are the weight values of critical attributes, satisfying m i=1 α i = 1 and 0 ≤ α i ≤ 1; (6) β j (m + 1 ≤ j ≤ m + s) express the weight values of the non-critical attributes with m+s j=m+1 β j = 1 and 0 ≤ β j ≤ 1; (7) w i,k (1 ≤ i ≤ m + s, 1 ≤ k ≤ n) are the weight values of sub-attributes making up the ith attribute with n k=1 w i,k = 1 and 0 ≤ w i,k ≤ 1, if the kth sub-attribute is not one of the sub-attributes forming the ith attribute, set w i,k = 0; (8) is utilised to control the effect of the smallest critical attribute on the software trustworthiness with 0 ≤ ≤ min{1 − α min , ln y 0 −ln y min ln y min −ln 10 }, among them, y 0 is a value that all attributes must achieve and min is the i with min 1≤i≤m {y i }; (9) ρ is a parameter associated with the substitutivity between critical and non-critical attributes with 0 < ρ; (10) ρ i (1 ≤ i ≤ m + s) are parameters related to the substitutivity between the subattributes that make up the ith attribute satisfying 0 < ρ ≤ ρ i ; (11) x k (1 ≤ k ≤ n) are the values of sub-attributes with 1 ≤ max{x 0 , y 0 } ≤ x k ≤ 10, among them, x 0 is a value that all sub-attributes must reach.
For convenience, we denote the i with min 1≤i≤m {y i } by min , the i with max 1≤i≤m {y i } by max , the i with min m+1≤i≤m+s {y i } by min , the i with max m+1≤i≤m+s {y i } by max , and let The proof of Proposition 4.2(2) has been given in the Claim 1 in Tao et al. (2015). Here we only give the proof of Proposition 4.2(1).
Substituting n k=1 w i,k = 1 in the above inequality, we can get Raising to − 1 ρ i power in the above inequality, it follows that From the definition of STMBDA, we know that
The proof process of Proposition 4.4, 4.5, and 4.6 can be found in Tao et al. (2015) Proposition 4.7: T meets substitutivity. (2), the substitutivity between the sub-attributes making up the ith (1 ≤ i ≤ m + s) attribute is able to be determined as follows:

Proof: By Equation
Similarly, through Equation (1), for the substitutivity between critical attributes, it follows that for the substitutivity between non-critical attributes, we can deserve and the substitutivity between critical and non-critical attributes can be obtained as and 0 < ρ, Similarly, for j = min , m + 1 ≤ i ≤ m + s, It can be seen from Equation (3) that the sub-attributes can be replaced with each other to a certain extent. According to Equations (4), (5), (6) and (7), attributes can be replaced by each other to a certain extent. By Equations (6) and (7), the substitutivity between critical and non-critical attributes is harder than both the substitutivity between critical attributes and that between non-critical attributes, and the substitutivity between sub-attributes is harder than that between attributes. In summary, the proposition is proved.

Proposition 4.8: T satisfies expectability.
Proof: By the definition of STMBDA, we can obtain that According to Proposition 4.2(1), it follows that and from the proof process of Proposition 4.2(1) we know that 1 ≤ y i ≤ 10, 1 ≤ i ≤ m.

Measurement process based on STMBDA
The STMBDA-based measurement process is given in Figure 1. For a given software, Step 1 is used to determine the sub-attribute values x 1 , . . . , x n , the sub-attribute weight values w i,k (1 ≤ i ≤ m + s, 1 ≤ k ≤ n) and the parameter values ρ 1 , . . . , ρ m+s . Then the attribute values y i (1 ≤ i ≤ m + s) are calculated by STMBDA in Step 2.
Step 3 computes critical attribute weight values α 1 , . . . , α m , the non-critical attribute weight values β m+1 , . . . , β m+s , and the parameter values ρ and . The above weight values can be calculated using the method proposed in Tao and Chen (2012). In Step 4, STMBDA is utilised to get the degree of software trustworthiness based on the results of Step 2 and Step 3.

Empirical validation
The empirical validation contains case study, survey and experiment (Fenton & Bieman, 2015). An empirical validation of STMBDA is conducted by a real case in this section. The trustworthiness of spacecraft software is one of the crucial factors to guarantee the success of space mission. However, their current evaluation is only qualitative. To tighten the measurement of trustworthiness of the spacecraft software, with the assistance of STMBDA, their trustworthiness is assessed by utilising axiomatic approaches. The trustworthy attributes of spacecraft software contain 9 attributes, subdivided into 28 sub-attributes (J. Wang et al., 2015). The attributes, sub-attributes and the corresponding weight values are described in Table 1 (J. Wang et al., 2015).
Trustworthiness of each sub-attribute is classified into four levels: A, B, C and D. To calculate the attribute values based on STMBDA, levels of sub-attributes are converted to specific values. Level A is converted to 10, Level B to 9, Level C to 7, and Level D to 2. A panel of 10 experts was invited to classify the 28 sub-attributes of the 23 spacecraft software (J. Wang et al., 2015). The scoring method is that each expert divides them according to the four grades of A, B, C, and D in the form of a combination of subjective and objective principles. For the specific classification standards, please refer to Section 7.2 of the reference (Chen & Tao, 2019). We chose 11 representative software as subjects, which are numbered 2, 4,6,7,9,18,19,20,21,22 and 23. Figure 2 shows the distributions of sub-attribute values of these 11 software. The horizontal axis of each subgraph of Figure 2 displays the number of sub-attribute, and the vertical axis shows the sub-attribute values.
Given that the difficulty of the substitution between the sub-attributes constituting critical attributes should be greater than that between the sub-attributes making up non-critical attributes, let ρ j < ρ i (1 ≤ i ≤ 4, 5 ≤ j ≤ 9). For simplicity, set ρ 1 = ρ 2 = ρ 3 = ρ 4 , ρ 5 = ρ 6 = ρ 7 = ρ 8 = ρ 9 , their specific values used in this case are given in the Figure 3. The distributions of attribute values computed through STMBDA of these 11 software are presented in Figure 3. In each subgraph of Figure 3, the vertical axis represents the number of attribute and the horizontal axis displays the attribute value. In order to compare with the model given in J. Wang et al. (2015) (referred to as PBSTE3), PBSTE3 is also used to measure the attribute trustworthiness of these 11 representative software. In PBSTE3, both the trustworthiness measurement model and the attribute measurement model are in the form of the product of power functions. According to the sub-attribute weight values given in Table 1, the attribute value distributions calculated by the attribute measurement model in PBSTE3 are shown as the yellow line in Figure 3.
It can be seen from Figures 2 and 3 that the measurement results of attribute trustworthiness obtained by STMBDA can reflect the actual development of spacecraft software. For instance, since such software is developed by GJB5000A standard, the software change control for these 11 software is generally good. Meanwhile, the weaknesses in the software development process are able to be easily identified. For example, these 11 software generally lack special testing, the reason is that the testing verification technology is not advanced because of the design of dynamic timing, space, data use, and control behaviour, etc. Furthermore, the attribute metric model presented herein is more universal when compared to the attribute metric model built in J. Wang et al. (2015). We can adjust software attribute trustworthiness through the parameters ρ i (1 ≤ i ≤ 9) in STMBDA. If we have higher trustworthy requirements for attribute, then we raise the values of ρ i (1 ≤ i ≤ 9), and vice versa. However, the distributions of sub-attribute values in Figure 2 demonstrate that some grading criteria are too high to be achieved and some are too low to be easily implemented, thus the grading standards of sub-attribute trustworthiness need to be improved in future applications.
For the given parameter values (ρ 1 , ρ 2 , ρ 3 , ρ 4 , ρ 5 , ρ 6 , ρ 7 , ρ 8 , ρ 9 ) = (6, 6, 6, 6, 3, 3, 3, 3, 3), the distributions of trustworthy degrees of these 11 software calculated through STMBDA are given in Figure 4. The yellow line in Figure 4 is obtained by the software trustworthiness metric model presented in J. Wang et al. (2015) according to the weight values presented in Table 1. As can be seen from Figure 4, STMBDA does not greatly raise the trustworthy degree of the software due to the high value of individual sub-attribute, but the sub-attributes with lower values will reduce the degree of the software. That is, for the software to be trusted, each attribute must be trusted to a certain value. Meanwhile, the software trustworthiness measurement model built in J. Wang et al. (2015) cannot reflect this situation well. In a similar way, we can adjust the software trustworthiness through the parameter ρ as required. If we have a higher trustworthy requirement for software, then we raise the value of ρ, and vice versa. At the same time, we can regulate the influence of the minimum critical attribute on software trustworthiness by modifying the parameter This case demonstrates that STMBDA is suited to the measurement of the spacecraft software trustworthiness and can effectively assess their trustworthiness and accurately discover the vulnerability in the development process, which is very important for improving the development level of this type of software.

Comparative study
In the remainder of this section, we compare STMBDA with PBSTE1 (Tao & Chen, 2009), PBSTE2 (Tao & Chen, 2012), PBSTE3 (J. Wang et al., 2015), evidence theory-based software trustworthiness measure (ERBSTM) (Ding et al., 2012) and fuzzy theory-based software trustworthiness measure (FTBSTM) (Shi et al., 2008) by the properties described in Section 3. The comparison results are given in Table 2, among them, × indicates the measure does not meet the corresponding property and √ represents the measure conforms to the corresponding property.
It is easy to prove that all the metric models comply with non-negativity. PBSTM1, PBSTM2, PBSTM3, ERBSTM and FTBSTM do not take into account the issue associated with proportionality, so none of them meet proportionality. Ding et al. (2012) and Shi et al. (2008) demonstrate both ERBSTM and FTBSTM satisfy expectability. J. Wang et al. (2015) show that PBSTM3 conforms to monotonicity, acceleration, sensitivity, substitutivity and expectability.
Next, we give a counter-example to show that neither PBSTM1 nor PBSTM2 satisfies the expectability. For a given software, suppose the number of its critical attributes m is 3 and the number of its non-critical attributes s is 2. The weight vector of critical attributes is (α 1 , α 2 , α 3 ) = (0.5278, 0.3325, 0.1396), and that of non-critical attributes is (β 4 , β 5 ) = (0.6667, 0.3333). Let = 0.01 and (y 1 , y 2 , y 3 , y 4 , y 5 ) = (8, 8, 7, 8, 8). Then the trustworthy degree of this software is 6.69 calculated by PBSTM1 and 6.54 computed by PBSTM2, both of which are less than minimum value of all the trustworthy attributes. Thus, neither PBSTM1 nor PBSTM2 complies with the expectability.
The rest of the comparison results can be obtained from the comparative study of Tao and Zhao (2018).
Finally, it can be concluded from Table 2 that STMBDA is superior to all five methods from the perspective of the properties introduced in Section 3.

Conclusion and future work
In this paper, we complete the set of expected properties of software trustworthiness measurement based on attribute decomposition, give two new properties, namely nonnegativity and proportionality, improve substitutability and expectability, and introduce STMBDA given in Tao et al. (2015). We verify the theoretical rationality of STMBDA through showing that it complies with the new property set, and the empirical validity by measuring the trustworthiness of 23 spacecraft software. We also present a comparative study, which shows that STMBDA outperforms the other five models in terms of the new property set. It should be noted that we only give the expected property set from the perspective of experience, which is reasonable but not complete. On the other hand, in the empirical validation, we specify the parameter values , ρ and ρ i in advance and do not give them the specific solving algorithms.
Several issues are worthy for further investigation. Firstly, we believe that the software trustworthiness measure properties given in this paper are necessary, but not sufficient. We will be interested in extending and perfecting this set of properties. Secondly, attribute decomposition-based software trustworthiness measures that do not meet the properties described in Section 3 cannot be taken as legitimate measures. However, metrics satisfying these properties should be used only as candidate metrics, and they still need to be better checked. We have utilised STMBDA to spacecraft software trustworthiness measurement, and we will use STMBDA for trustworthiness measurement of other types of software to conduct a comprehensive empirical validation in the future. Thirdly, we do not give a way to calculate the parameter values , ρ and ρ i in STMBDA, and how to determine these parameter values is also important for future work.