Estimating Software Reliability Using Size-biased Modelling

Software reliability estimation is one of the most active areas of research in software testing. Since time between failures (TBF) has often been challenging to record, software testing data are commonly recorded as test-case-wise in a discrete set up. We have developed a Bayesian generalised linear mixed model (GLMM) based on software testing detection data and a size-biased strategy which not only estimates the software reliability, but also estimates the total number of bugs present in the software. Our approach provides a flexible, unified modelling framework and can be adopted to various real-life situations. We have assessed the performance of our model via simulation study and found that each of the key parameters could be estimated with a satisfactory level of accuracy. We have also applied our model to two empirical software testing data sets. While there can be other fields of study for application of our model (e.g., hydrocarbon exploration), we anticipate that our novel modelling approach to estimate software reliability could be very useful for the users and can potentially be a key tool in the field of software reliability estimation.


Introduction
Softwares are fuel of the current world.Economy, technology, transport, communication, medical treatment -all of these essential components of our daily lives are critically dependent on successful execution of softwares.Most of the modern day devices may not function properly if the concerned softwares carry bugs.Thus, it is not surprising that estimation of software reliability remains a cornerstone in the field of software development and testing [13,15].Determination of optimum time for software release remains an interesting field of research [3].It has also been proposed that the software testing data needs to be collected in a relatively efficient manner than the traditional process [11].Time between failures (TBF) data have become difficult to collect as the complicacies in software development and its testing increases.In most cases, the logged information during software testing is test-case specific and consequently discrete in nature.Estimation of optimum duration of software testing under a discrete set up has also received considerable attention [2,1,7,5].In these literature, optimum testing strategies have been developed based on the number of remaining bugs present in the software [3,8].However, if the remaining bugs are present in paths/locations (of the software) which will rarely be traversed by any inputs to be used by the users, the chances of software failure will also be rare, which in turn, could not solely infer the software to be unreliable.Though, this particular phenomenon seems plausible and close to reality, it has not been systematically studied yet in the literature.
In order to account for the probability of a remaining bug to be the cause of failure of a software, we introduce a latent variable, 'eventual size of a bug' [1].The eventual size of a bug is defined as the number of inputs that may eventually pass through the bug during the entire lifetime of a software, irrespective of whether the bug is detected or not during the testing phase.Occasionally, the eventual size of a bug is also referred to as simply 'the size of the bug'.A software can be considered as a collection of several paths and each input to the software is expected to follow a particular path.In particular, if the same input is used several times, it can only check whether that particular path has any bugs or not, but it will not be able to check the presence of bugs in other paths as the given input will not traverse those other paths.A software would be require different inputs to check existence of bugs in different paths.We may assume that an input can only identify at most one bug which lies on the path that the input would traverse.This size-biased approach was first introduced by [1] in software reliability, although the concept had also been applied in a few other fields of investigation [12].

Some terminologies in software testing
Differential sizes of the bugs in paths, subpaths of a software.It is quite natural that a path in the software branches into several sub-paths at a later stage.For all these sub-paths, a part of the paths is common for all in the beginning.Now suppose that a bug is present on the common path and another bug is on one of the several sub-paths associated with the common path.It is quite natural that the size of the bug in the common path is much higher compared to that of the bug in the sub-paths, since the inputs passing through each of the sub-paths must have traversed through the common path before entering into a sub-path.The size of a bug also, thus, may give an indication of how quickly a bug could be identified.If a bug of large size is not detected, it may be a potential threat to the functioning of the software.Thus, it is straightforward that the probability of detection of a bug depends on the size of the bug.Larger the size of the bug, larger will be the probability of detecting that bug, as has been indicated in [2].In fact, they have also shown that similar concepts are applicable in discovering fields with rich hydrocarbon contents in the field of producing oil and natural gas.
Software reliability and its dependence on location of a bug.A bug which exists in a path that is rarely traversed by any input, is likely to be harmless as far as running of the software is concerned.Thus, reliability of the software does not only depend on the number of bugs remaining in the software, it also depends on the positioning of the bugs, particularly the paths on which it exists and whether that path is frequently traversed by random inputs from the user [10].Hence in order to have a better model for software reliability, our attention would be to find out the total size of the bugs that will remain and not just the number of remaining bugs.
Software testing and different testing phases.
In a discrete software testing framework, when an input is being tested, it results in either a success (i.e., finding an error) or a failure (i.e., not finding an error).Testing of software is carried out into many phases, where, in each phase a series of inputs are tested and results of each testing are recorded as either a success or a failure [7].After identifying the bugs at the end of testing within a phase, they are debugged at the end of the phase.This process of debugging is known as periodic debugging or interval debugging [4].
Hence, detecting a bug during software testing, can be thought to be a probabilistic sampling, where the chances of a bug being detected is an increasing function of the size of the bug.This is analogous to the size-biased modeling by [12], for modeling identification of species.

Motivation
The present work was motivated by one key idea from one of the authors in early nineties, that the optimum time to stop software testing and the optimum time to stop drilling in hydrocarbon exploration, were found to be analogous [2,1].It is quite logical to understand that bigger field, in terms of the amount of oil and natural gas that can be obtained after drilling in, are expected to be drilled much ahead compared to the others.Thus, size (in terms of the value of the oil and natural gas in the field) plays an important role in identifying the chronological order of the drilling areas.Ideally, this strategy would minimize the overall drilling cost.Similarly, once the size of a bug is appropriately defined (as has been done in earlier paragraphs), the size-biased nature of the problem can be used to model the detection probability of a bug in software testing data.However, the eventual size of a bug remains unknown, which becomes the major challenge to the present problem.
The article is organized as follows.In Section 2, we developed a Bayesian generalized linear mixed model, whereas in Section 3, we provided a description of model fitting and model performance measures used for this study.In Section 4, we showcased a simulation study to assess the performance of our model.We assessed the performance of the models using relative bias, coefficient of variation, and coverage probability.Application of this model was carried out on two empirical software testing data sets.Section 5 illustrates an application of the developed Bayesian model to a commercial software testing data set and in Section 6, we showed an application of the model to a very critical software data set used for space mission software testing.The article ends with a discussion and conclusion in Section 7.

General approach
We utilized the hierarchical modelling philosophy to formulate a statistical model to address the problem of estimating the total number of bugs in the presence of imperfect detection of the bugs during software testing process.The developed model can also be used to estimate the remaining eventual bug size that are present in the software, as well as the software reliability.We also provided a new method to predict the stopping phase such that the estimated remaining bug size at that phase remains below a preassigned threshold.Later, we extended the model described above to also accommodate the possible groups of bugs who share the same bug size.

Model description
The model has composed of two hierarchical structure: one for the state process that explains the latent dynamic of the bugs within the software, while the other part corresponds to the observation model explaining the probabilistic structure of the observed software testing data.

State process
Consider N number of distinct and independent bugs are present in a particular software and size of each bug is denoted by S i , i = 1, 2, . . ., N .The eventual size of a bug (or in short, size of a bug) is considered as a latent variable in the model and is needed to be estimated.Let S denotes a vector of these latent variables S 1 , S 2 , . . ., S N defining the size of the N (unknown) bugs under study.For the ease of computation and other technical advantages (described later), we define N ∼ Binomial (M, ψ), where M represents the maximum possible number of bugs present in the software and ψ denotes the inclusion probability to indicate the proportion of M that represent the real population of bugs.

Observation process
We suppose that T j , j = 1, 2, . . ., Q inputs are used for each of the Q testing phases.We consider the situation where a present bug can get detected in any of the T j inputs at the j-th phase, j = 1, 2, . . ., Q.
Let y ij represent the binomial detection outcome for a bug i over the T j inputs at phase j.If y ij > 0, this subsequently implies that y il = 0, l = 1, 2, . . ., (j − 1).It should be noted that after a bug gets detected at the j-th phase, it is eliminated from the pool of bugs during the debugging at the end of phase j.For example, in a software testing, if bug 1 gets detected at phase j = 4, we would have y 11 = y 12 = y 13 = 0 and y 14 > 0.
We used the data augmentation approach to model the number N of bugs in the software by choosing a large integer M to bound N and introduced a vector of M latent binary variables z = (z 1 , z 2 , . . ., z M ) such that z i = 1 if individual i is a member of the population and z i = 0 otherwise.We assume that each z i is a realisation of a Bernoulli trial with parameter ψ, the inclusion probability.
A binomial model, conditional on z i , is assumed for each observation y ij : where p i denotes the detection probability of the ith bug in a phase.The detection probability p i is modelled as a increasing function of the bug size S i , since the detection probability directly depends on the size of a bug, that is, more the bug size, higher the detectability.

Model for detection probability
From the definition of bug size, S i is higher if placement of i-th bug is on a common path near the origin and a number of sub-paths follow subsequently.If r denotes the probability of bug detection in any one of the inputs that will pass through the i-th bug, then the probability of detecting i-th bug with one input is The parameter r plays the role of a shared parameter across all the bugs and critical for the dependence structure of the nodes in our joint probability model.In addition, the above formulation of p i comes naturally from our definition of bug size and accounts for individual-level heterogeneity in detection probability of the bugs [12].Note that, here p i is modelled as a monotonically increasing function of S i and when S i = 0, we have p i = 0.

Model for N
We used the data augmentation approach to model the number of bugs N in the software by choosing a large integer M to bound N and introduced a vector of M latent binary variables z = (z 1 , z 2 , . . ., z M ) such that z i = 1 if individual i is a member of the population and z i = 0 otherwise.We assume that each z i is a realisation of a Bernoulli trial with parameter ψ, the inclusion probability.
We assume that n bugs get detected over the Q testing phases which is expected to be less than the total number of bugs N due to imperfect detection during testing.Consequently, as part of the data augmentation approach, the detection data set {y ij } i,j is supplemented with a large number of "allzero" encounter histories Y rem , an array of "all-zero" detection histories with dimensions (M − n) × Q.We label the zero augmented complete detection data set as Y .

Estimating the remaining eventual bug size and the stopping phase
In software testing, certain decisions are critical: for example, when should we stop testing, what should be the criteria to stop software testing process.If after the testing and debugging phases, certain bugs remain in the software, it may cause improper functioning of the software even after the market release.Therefore, a decision to optimize software testing and debugging time is an important part of the development process of software.
The above model is well suited to estimate the number of bugs N , the detection probability p i 's, and bug size S i 's.But to estimate the remaining eventual bug size at a later untested phase, we proceed as follows.
We denote f as the model for the detection observations for a bug with number of inputs T j , j = 1, 2, . . ., J, where J > Q, and ỹ as future observation or alternative detection outcome that could have been obtained during the testing phase.Since the stopping phase (such that the remaining eventual total size of the bugs is less than a threshold, say, ) is unknown to the software tester, we assign a sufficiently large value for J, considering the available RAM size of the computing device and and computing time.The posterior predictive model for a new detection data ỹ i for the i-th bug is then, where θ denotes the vector of all the parameters r, S , z , ψ and f (ỹ i | Y ) is the predictive density for ỹ i induced by the posterior distribution π(θ | Y ).
In practice, we obtain a single posterior replicate ỹ (l) i by drawing from the model f (ỹ i | θ (l) ), where {θ (l) : l = 1, 2, . . ., L} represents a set of MCMC draws from the posterior distribution of parameter θ.
We define a set of deterministic binary variables u ij which takes the value 1 if i-th bug is detected on or before j-th phase and 0 otherwise.Total size of the bugs that are detected up to the j-th phase is then computed as Consequently, we also compute the total eventual remaining size of the bugs that are not detected up to the j-th phase, 1, 2, . . ., J. We obtain the stopping phase, denoted by k, such that B k < (where is a preassigned threshold).We compute B j for each replicated data set {ỹ (l) i : i = 1, 2, . . ., M }, l = 1, 2, . . ., L, thus en-abling us to obtain an MCMC sample for both k and {B j : j = 1, 2, . . ., J}.

Software reliability
For software testing detection data set, we define software reliability, at a testing phase j, as the posterior probability that the total eventual remaining size of the bugs B j (that are not detected up to the j-th phase), is less than or equal to the prefixed small quantity given the observed detection data Y , Consequently, reliability is a non-decreasing function of threshold and testing phase j.Asymptotically, for a fixed j, (i) as → 0, γ j ( ) → 0 and (ii) as → ∞, γ j ( ) → 1.Similarly, for a fixed , if we conduct a very large number of testing phases (i.e., j becomes large), reliability γ j ( ) will be very close to 1.Of course, this rate of convergence will also depend on the number of testing inputs T j in each phase.

Modelling for grouped bugs
Often we come across situations where a few bugs are collocated on the same path or same part of the software in such a way that we can assume without loss of generality that each of them have the same bug size.For computational and notational simplicity, we make a transformation of the data set ((y ij )) to (y * g ) where the observed data y * g represents the number of bugs from the g-th group that are detected.Consequently, we have y * g ∼ Binomial(T j(g) , p * g ), p * g denotes the probability of detecting a bug belonging to the g-th bug-group with a single test case and j(g) denotes the corresponding phase to the g-th group.
Here, we consider a number of distinct group of bugs N G that are present in a software and each bug in a group (say, g-th) has size S * g .Each group of bugs comprises at least one bug.Following Section 2.2.1, we define N G ∼ Binomial (M G , ψ), where M G is a large positive integer that gives an upper bound to N G .The link between p g and the size S * g remains the same as in Section 2.2.3, p * g = 1−(1−r * ) S * g .We used the data augmentation approach to model the number of bug-groups N * (discussed in Section 2.2.4).The total number of bugs N * has the following ex-pression: where n denotes the number of bugs detected during the testing period and a g denotes the number of bugs in the g-th group that went undetected.We utilized the posterior predictive distribution of new detection data ỹ * with density f (ỹ * g | Y * ) to estimate a g .To compute the remaining eventual size, we introduce binary variables ((u gQ )), g = 1, 2, . . ., M G , where u gQ takes the value 1 if g-th bug-group is detected on or before Q-th phase and takes 0 otherwise.The remaining eventual size is calculated as , where d g denotes the number of bugs in g-th bug-group .

Prior assignment
Bug sizes (S i 's) are usually latent and unobservable.We assign a Poisson-Gamma mixture prior for S i to capture the required level of variability in the latent variable.Consequently, each S i is assumed to follow Poisson distribution with mean λ i , where the λ i is a random draw from Gamma distribution with shape parameter a s and rate b s .We assign bounded Uniform prior over the interval (0, 1) for detection probability r and the inclusion probability ψ.These proper prior specifications ensured propriety of the posteriors.

Model fitting
We fitted models using Markov chain Monte Carlo (MCMC) simulations.In particular, we used Gibbs sampling for simulating the parameters from the posterior distribution.The full posterior of each z i is Bernoulli distributed random variables, whereas the full posteriors of the other parameters and latent variables (e.g., ψ, r, S i 's) are of non-standard forms.We used slice sampler for S i 's and random walk Metropolis-Hastings sampler for the other parameters (e.g., ψ, r).We implemented MCMC computations using NIMBLE [6] in R software [14].We ran three chains of 10000 iterations including an initial burn-in phase of 5000 iterations.MCMC con-vergence and mixing of each model parameters was monitored using the Gelman-Rubin convergence diagnostics R [9, with upper threshold 1.1] and MCMC traceplots.

Model performance measures
We used relative bias, coefficient of variation and coverage probability to evaluate the effect of detection function misspecifications on population size and home range size estimators.Suppose {θ (r) : r = 1, 2, . . ., R} denotes a set of MCMC draws from the posterior distribution of a scalar parameter θ.
Relative bias.Relative bias (RB) is calculated as where θ denotes the posterior mean 1 R R r=1 θ (r) and θ 0 gives the true value.
Coefficient of variation.Precision was measured by the coefficient of variation (CV): where (θ (r) − θ) 2 is the posterior standard deviation of parameter θ.
Coverage probability.Coverage probability was computed as the proportion of model fits for which the estimated 95% credible interval of the estimate (CI) contained the true value of θ.
4 Simulation study

Description of simulated data and simulation scenarios
For a complex high-dimensional model such as described in Section 2.2, it would be instrumental to assess model performance with respect to different ranges of the model parameters.We simulated detection data sets of software testing for two values of detection parameter r, viz., 0.75 × 10 −5 and 1.5 × 10 −5 , and two values of number of inputs in each phase (T j ), viz., 1000 and 2000.In total we have four dif-ferent simulation scenarios (viz., Sets 1-4) and we simulated a total of 200 data sets (i.e., 50 data sets under each scenario).In each scenario, we assumed a fixed number of bugs N = 200 for simulating the detection data of bugs and the software testing was carried out over Q = 5 phases.The key details of the simulated data sets are given in Table 1.The number of detected bugs (and also the total number of detections) are higher on average (mean 132) in the set 2 with number of inputs as 2000 as compared to set 1 (mean 106) with number of inputs as 1000, detection parameter r remains unchanged in both these two sets at 0.75 × 10 −5 .Same phenomenon can be observed for sets 3 (number of inputs = 1000) and 4 (number of inputs = 2000) where r = 1.5 × 10 −5 (see Figure 1a,c).For estimating the remaining eventual bug size and the stopping phase, the posterior predictive simulations are carried out for 25 additional phases, implying J = Q+25 = 30 (see Section 2.2.5).

Results from Simulation study
We fitted our Bayesian size-biased model to each of the 200 simulated data sets using MCMC and M is set to 400 for each model fitting.All MCMC samples of the parameters of interest (e.g., population size N , detection parameter r) were obtained after ensuring proper mixing and convergence, with R values below 1.1.The posterior estimates of different parameters were obtained using the MCMC chains.The posterior summaries of the total number of bugs N and detection parameter r for the simulation study are provided in Table 2, respectively and also portrayed in Figure 1.
We estimated the reliability at the end of each phase and also at different possible future phases (assuming a pre-specified number of test cases in each phases).It is important to mention that, the estimation of reliability heavily depends on the pre-specified threshold and the number of test cases used during the future phases (that would be conducted after the first 5 phases already conducted).Here we have assumed that the number of test cases in each future phase to be the same as the number of inputs in the respective scenario.
The reliability (i.e., posterior probability of the remaining size lying below a threshold) is a nondecreasing function of testing phase index, since re-maining bug size gets reduced with more bugs being detected in subsequent testing phases.We found the reliability estimates to attain the targeted 95% level (with threshold 100) to be varying with respect to different simulation scenarios (Figure 1).For instance, the reliability estimate attained the optimum 95% level (with threshold 100) at phase 30 in set 1, implying the developer would need to continue software testing for 25 more future phases (after the 5 testing phases already conducted) to attain optimum software reliability level.Hence, the stopping phase was estimated as 30.For other sets, the estimates of the stopping phases were at phase 24 (set 2), phase 14 (set 3) and phase 10 (set 4). 5 Application to commercial software testing empirical data

Data description
The data set consists a total of 8757 test inputs detailed with build number, case id, severity, cycle, result of test, defect id etc.In this data, the severity of a path is broadly divided into three categories, namely, simple, medium and complex depending on the effect of the bug if it is not debugged before marketing the software.The data has four cycles namely Cycle 1, Cycle 2, Cycle 3 and Cycle 4, which is equivalent to the different phases of testing we have referred to Section 2. After each cycle, the bugs that are identified during the cycle are debugged as mentioned in the Section 2.

Results from commercial software testing data analysis
The posterior estimates of the main parameters N , ψ, r and B 4 are provided in Table 3 and visually portrayed in Figure 2. The posterior mean estimate of the total number of bugs was 348 with a 95% credible interval (317, 382).The posterior mean of inclusion probability ψ was estimated at 0.696 with a 95% credible interval (0.618, 0.774).The estimate of ψ also confirmed that the upper bound M = 500 we had set was sufficiently large enough to not to influence in the estimation of N .Although the posterior mean estimate of size-biased detection model parameter r was estimated at a very small magnitude 8.761 × 10 −6 , we had coded the parameter with a logistic transformation to retain the accuracy in estimation and MCMC mixing.The remaining eventual bug size after the 4 testing phases was estimated as 703 with a 95% credible interval (457, 1006).Here we have assumed that the number of test cases in each future phase to be 3000 in order to resemble with the observed data set.
We found the reliability to attain the target 95% level at phase 16 if we would have continued with 3000 test cases in each phase, implying the developer would need to continue software testing for 12 more future phases (after the 4 testing phases already conducted) to attain the targeted software reliability level.Hence, the stopping phase was estimated as 16.The reliability took much longer (40 phases) to reach the targeted 95% level with 1000 test cases in each phase, and took only 12 phases with 5000 test cases in each phase (these results are provided in the appendix).This also revealed that it takes approximately 36000 future test cases to attain the targeted reliability of 95%.
. There were also different number of test cases for each mission in each software and in each phase.For our analysis we consider the testing data from MT and seven phases of ST (i.e., Q = 8 testing phases) in total as observed data set.We use the detections during CI as deterministic constant because of the lack of probabilistic structure of this testing phase.

Results from ISRO mission data analysis
We applied the grouped version of our size-biased model (Section 2.3) to ISRO mission data set which was perfectly suited for applying this model.The different missions, different softwares used in those missions and the different phases -all contributed to the variation of groups and number of bugs in a group.In the observed data set, any change in the mission, software or phase was considered as a different group formation.Here, it is not possible to extend the number of phases, hence instead of finding a stopping phase, we obtain the number of future test cases required to get the remaining bug size below a pre-specified threshold.This future test cases can be implemented before a future mission or after a software update.
The posterior mean of number of groups of bugs was estimated at 84 with a 95% credible interval (80, 89) (see Table 4).The posterior mean estimate of ψ is 0.257 with a 95% credible interval (0.195, 0.323).This also confirms our specified upper bound M G = 200 for the number of groups to be appropriate.The size-biased detection model parameter is estimated as 1.102 × 10 −3 with a 95% credible interval (6.439×10 −4 , 1.807×10 −3 ).The total number of bugs present was estimated as 94 with 95% credible interval (94,95) which is highly precise.
The reliability of the softwares is estimated as 0.995 after the 8 testing phases (including module testing and seven phases of simulation testing) with threshold = 25.Since the testing phases had managed to detect almost all the bugs present in the softwares, this has led to such high reliability.We also show that reliability increases with the increase in number of future test cases (Figure 3).

Discussion
We described a Bayesian generalized linear mixed model that can be applied to software testing detection data set to explicitly model and estimate the population size, detection probability and latent size of the bugs.The model also allows estimation of software reliability for any given threshold (Section 2.2).Consequently, we could obtain an estimate of the stopping phase providing the number of additional phases of testing are required to achieve an optimum reliability level (say, 0.95).
We showed via a simulation study that the parameters of interest (e.g., N , r, reliability) can be accurately estimated by our model.Number of inputs plays a key role in software testing in general, as higher number of inputs boosts the probability of detecting of bugs (Table 1).This also led to more accurate estimation of the model parameters, which can be observed in the lower magnitude of CV estimates of N and r with higher number of inputs (Table 2).Further, we also noticed that, in such scenarios, threshold reliability level was attained comparatively quicker than the scenarios with lower number of inputs (Figure 1e).Size biased model fitted to empirical software testing data of bugs yielded satisfactory estimates of the key parameters.However, we noticed that the software testing conducted were rather inefficient since the estimated software reliability was approximately near zero after the first four phases of testing (Figure 2).We anticipate that some major bugs (with moderately large size) were still present.We receommend to continue testing for at least 36000-40000 more test cases (which could be broken down into multiple phases) to attain the desired software reliability level 95%.
On the contrary, software reliability estimates of ISRO mission softwares were found to be extremely high (i.e., 0.998) after the first 8 testing phases, demonstrating the advantage of efficient software testing.Our finding that the number of bugs detected were almost equal to the true number of bugs available to be detected also supports this.
The developed model can also be used for similar problems in the other fields.For instance, in hydrocarbon exploration, digging a field can be considered analogous with testing a software with different inputs, outcome of which can be considered either as a success (implying sufficient hydrocarbon has been found after digging) or as a failure (implying that the digging did not yield sufficient hydrocarbon which may be viable).
Given the enormous amount of interest in software testing in technology sector, our size-biased model could be very useful to provide accurate estimates of the number of present bugs as well as software reliability.Our model used the Bayesian paradigm which added the required flexibility to estimate a large number of model parameters.Although we found the parameter estimates to be moderately robust, we recommend to conduct a prior sensitivity study before application of the size-biased model.(e)

Figure 1 :
Figure 1: Details of simulated data summary and parameter estimates across the 4 simulation scenarios.Panels (a) and (c):Violins of the number of detected bugs (panel a) and the total number of detections across all the bugs, i.e.,

Figure 2 :
Figure 2: Details of commercial software testing data summary and parameter estimates.Panels (a) and (b): comparison of number of detected bugs (panel a) and total number of detections across all the bugs, i.e.,

Figure 3 :
Figure 3: Details of ISRO mission software testing data summary and parameter estimates.Panel (a): Number of detected bugs (panel a) in each phase of ISRO mission data set.Panels (b) and (c): Posterior density violins of the population size estimator N (panel b) and detection parameter r (panel c).Panel (d): Estimates of posterior reliability with different thresholds 25,50,75,100,150,200.The horizontal dotted line represent the reliability estimate after first 8 testing phases.The bars in each barplot correspond to different numbers of future test cases 25,50,75,100,150,200,250,300.Each barplot in panel (d) corresponds to a distinct threshold (given along the x-axis).

Table 1 :
Number of detected individuals and number of total detections (mean, median, 2.5% and 97.5% quantiles) in simulated SCR data sets across 50 repetitions for each simulation scenario.

Table 2 :
Relative bias (mean, median, 2.5% and 97.5% quantiles), coefficient of variation (mean, median, 2.5% and 97.5% quantiles) and coverage probability of the 95% credible interval for population size N and detection probability r across 50 repetitions for each simulation scenario.

Table 3 :
Estimates of different parameters in the data analysis of commercial software testing data (Section 5).

Table 4 :
Estimates of different parameters in the ISRO mission software testing data analysis (Section 6).CI = 33 bugs were detected during CI stage, n M T = 27 bugs were detected during MT and n ST = 34 bugs were detected during ST (where the phase specific segregation is as the following: It is hereby declared that the authors do not have any conflict of interest.