Never underestimate biodiversity: how undersampling affects Bray–Curtis similarity estimates and a possible countermeasure

Abstract The Bray–Curtis dissimilarity is widely used to calculate β diversity on abundance data. However, the effect of undersampling on this index has received limited attention and only few studies addressed this topic. The paper aimed to investigate the error introduced by undersampling and its correlation to the similarity of the complete datasets, proposing a possible countermeasure, which is based on the addition of dummy species. To evaluate the performance of this proposed approach, we applied a meta-analytic technique based on repeated and random subsamples of 16 datasets on published biological assemblage data. We estimated the effect of undersampling on the resulting similarities and we compared the results with the adjusted version of the index resulted from the addition of extra species, also called dummy species, to the original abundance dataset. Undersampling generally resulted in poor accuracy and led to underestimates of assemblage similarities. The addition of dummy species resulted in a decrease in the severity of underestimations. To reach an accuracy >80% in similarity, more than 300 individuals needed to be randomly sampled and the under – and over-estimation rates decreased consistently by the addition of dummy species. Additionally, we found that the more similar two assemblages were, the more likely similarities were underestimated and this tendency was more severe at low sample sizes. Our simulation indicated that datasets which contain more than 300 individuals provided reliable estimates of similarities and that the addition of one to three dummy species to the abundance matrices was a good choice to reduce underestimates and increase accuracy.


Introduction
Biodiversity is the variety of life, in all of its many manifestations.It encompasses all forms, levels and combinations of natural variation (Gaston & Spicer 2004).This term, first coined in 1986, today is at the forefront of public and scientific attention.Measuring biodiversity has received much attention in past decades (e.g.Colwell & Coddington 1994;Magurran 2004), and biological diversity is commonly divided in α diversity, β diversity and γ diversity.The study of β diversity, defined as variation in the identities of species among sites, is genuinely at the heart of community ecology -what makes assemblages of species more or less similar to one another (Anderson et al. 2011).Ecologists often want to know how similar two or more assemblages are, for example to investigate the effect of different management regimes on species composition, compare assemblages from habitats at varying distances, follow changes in community composition over time, etc. Estimating similarity most of the times uses quantitative indices and today ecologists can choose between a multitude of dissimilarity indices (Jost et al. 2011;Beck et al. 2013), but there is no perfect function capable of summarizing all aspects of biological dissimilarity.By reducing the structure of a multidimensional set, such as a biological community, into a single number, information is necessarily lost (Ricotta & Podani 2017).Here we refer to β diversity as a concept that takes species distinction into account and compares the similarity of sites, which has been termed "differentiation diversity" (Jurasinski et al. 2009).
An important conceptual distinction is between β diversity metrics that use presence-absence data and metrics that include abundance information as well (Anderson et al. 2011).Estimates of incidencebased indices do not perform well, are generally biased downward and interpretation becomes especially difficult for comparing assemblages that contain numerous rare species (Chao et al. 2006).
The Bray-Curtis (dis)similarity (Bray & Curtis 1957) is one of the most frequently employed statistics when calculating β diversity on abundance data (Jost et al. 2011;Bacaro et al. 2012).Many textbooks and papers suggest using this index (e.g.Clarke & Warwick 2001;Kindt & Coe 2005;Schroeder & Jenkins 2018), particularly when computing dissimilarity matrices and when analyzing multivariate community data (Minchin 1987;Oksanen et al. 2019).This choice is based on sound statistical reasons; for example, Clarke and Warwick (2001), Clarke et al. (2006), and Legendre and De Cáceres (2013) listed a number of desirable properties that the Bray-Curtis index possesses and this β diversity metrics was also among those that performed best when tested against 18 desirable properties ( Barwell et al. 2015).Additionally, Bloom (1981) showed that the Bray-Curtis similarity index accurately reflects the true resemblance along its entire 0 to 1 range.This index has neither metric nor Euclidean properties (Gower & Legendre 1986), but does not suffer from the Orloci paradox (Legendre & Gallagher 2001), displays a consistent and linear behavior to changes in abundance (Ricotta & Podani 2017) and is one of the few ones not sensitive to nestedness (Barwell et al. 2015).However, some authors also caution against using the Bray-Curtis index, as it makes sense only if the sampling fractions are known to be equal (Chao et al. 2006;Jost et al. 2011).Another measure which might be appropriate for calculating β diversity is the Chord-Normalized Expected Species Shared (CNESS)-distance, but only recently have scripts been provided to calculate it conveniently (Zou & Axmacher 2019).
Undersampling, i.e. the incompleteness of species inventories due to limited field sampling, is a common problem in biodiversity studies (e.g.Chao et al. 2005;Cardoso et al. 2009;Beck & Schwanghart 2010;Beck et al. 2013).This means that rare species or rarely detected species in a community will be found less representatively in a small sample both with regard to their occurrence (i.e. they may often not be found) and their relative abundance (random variation in individual numbers have larger effects on their relative abundance in the sample) (Melo 2021), while common species will often be adequately represented (Beck et al. 2013).Estimates of incidence-based indices are generally biased downward and the bias increases when sample sizes are small (Chao et al. 2006), while for abundance data the effects of undersampling are thought to be less a concern (Beck et al. 2013) and Schroeder and Jenkins (2018) reported that the Bray-Curtis index is relatively robust to undersampling.However, it has been shown for a few datasets that undersampling of biological assemblages leads to underestimates of the true Bray-Curtis similarity (Cao et al. 2001;Chao et al. 2006;Schneck & Melo 2010;Barwell et al. 2015;Hardersen et al. 2017), an undesirable property, as a good β diversity metric should remain constant as the sample size decreases (Barwell et al. 2015).However, the effect of undersampling on the Bray-Curtis index has received only limited attention (Cao et al. 2001;Schmera & Eros 2006;Schneck & Melo 2010) and currently very few indications are available on what constitutes an adequate sampling size for assemblage data, gathered with the intent to calculate Bray-Curtis similarities (e.g.Schneck & Melo 2010;Hardersen et al. 2017).
One general problem generated by undersampling is that it greatly increases random variation in estimates; i.e. it leads to low precision of estimates (Beck et al. 2013) and the Bray-Curtis coefficient shows an increasingly erratic behaviour as values within samples become vanishingly sparse (Clarke et al. 2006).For such denuded samples, Clarke et al. (2006) suggested modifying the behaviour of the Bray-Curtis coefficient so that it is less erratic for samples with few individuals, and is defined for samples with complete absences.The solution proposed by the authors is to add a "dummy species" to the original abundance matrix, with value 1 for all samples.The result of this addition is that now the dissimilarity between two samples tends smoothly to zero as the samples become vanishingly sparse (Clarke et al. 2006).Adding a dummy species to each assemblage makes these also more similar and thus might alleviate the problem that undersampling of biological assemblages, which leads to underestimates of the true Bray-Curtis similarity (Chao et al. 2006;Hardersen et al. 2017).
The aim of this paper is to (i) use published datasets of biological assemblages to test for the effect of undersampling on the resulting Bray-Curtis similarities; (ii) investigate whether adding dummy species to the assemblages makes the

Never underestimate biodiversity
Bray-Curtis index less susceptible to undersampling; (iii) investigate how the error introduced by undersampling is correlated to the similarity of assemblages, to the number of sampled individuals and to the number of dummy species; (iv) define an optimal number or range of dummy species to add to the assemblages; and (v) discuss the consequences of our finding for some commonly applied statistical procedures involving the Bray-Curtis index.

Material and methods
We selected studies that report biological assemblage data, applying three criteria: (1) the study reports abundance data for at least two assemblages that can be compared; (2) the study reports a total number of data for the two assemblages which exceeds 1200 individuals; and (3) the studies span a wide range of animal assemblages, from invertebrates to vertebrates.Based on these criteria we chose a total of 16 studies which report on the following assemblages: Dung beetles (Andresen 2003), ground-dwelling invertebrates (Bonham et al. 2002), ground-dwelling ants (Boulton et al. 2005), stoneflies (Collier et al. 1997), demersal fish (Gristina et al. 2006), forest beetles (Hardersen et al. 2014), estuarian macrofauna (Heck et al. 1995), lepidoptera (Horváth et al. 2013), birds (Jokimäki & Kaisanlahti-Jokimäki 2003), butterflies (Maes et al. 2016), groundfishes (Mueter & Norcross 2000), mosquito species (Muturi et al. 2006), carabid beetles (Niemelä et al. 2002), beetles living in farmland (Shah et al. 2003), soft-sediment assemblages (Stark et al. 2003), and saproxylic beetles (Thorn et al. 2014) (Table I).The selected studies were characterized by a mean species richness of 55 (min = 13; max = 203), showing a ratio between abundance and species richness ranging from a minimum of 11.6 to a maximum of 10,880.9.The number of shared species extended from 10 to 138, with an average value of 41.
The data were analyzed using a meta-analytic technique to compare abundance data of the above datasets.We resampled the smaller assemblage of each dataset 1,000 times without replacement by randomly drawing 10, 30, 100, 300, and where appropriate 900, 3000, 6000, and 30,000 individuals.At the same time, the larger assemblage of each dataset was randomly resampled using the same sampling fraction of individuals as in the smaller assemblage.In other words, when we drew n individuals from the smaller assemblage, we calculated the proportion of the total and we applied the same proportion to the larger dataset, as the Bray-Curtis index becomes meaningless if unequal sampling fractions are considered (Chao et al. 2006).This resample protocol mimics the process of different degrees of undersampling under the theoretical assumption of equal detectability for all species in the dataset.For each resample, we calculated both the Bray-Curtis index value (hereafter called the true Bray-Curtis value) and the adjusted values obtained adding one, three or five dummy species.In order to estimate the expected proportion of species included in each resample we calculated the  (Chao et al. 2015) which expresses the percentage coverage of (un)detected species.Clarke et al. (2006) suggested adding one "dummy species" to the original abundance matrix, with value 1 for all samples.We also added three and five dummy species to investigate the effect on the resulting Bray-Curtis value, as we hypothesized that this might counter underestimating the true Bray-Curtis similarity due to undersampling.Variances in Bray-Curtis values, estimated adding zero, one, three or five dummy species, were compared using Levene's test separately for each group of resampled individuals.The coverage probability within and outside the range of ±10% around the true Bray-Curtis value was used to evaluate the proportion of cases accurately estimated and those falling in under-and over-estimation for each resample (Walther & Moore 2005).These data were combined for all dataset and reported as mean values (±sd) in accuracy plots.Afterwards, we measured bias by comparing the Bray-Curtis values derived from the sub-samples to the true values of the whole community and we used the mean error (ME) and the standard deviation (sd) as a measure of systematic error and precision of the sample (Walther & Moore 2005).Furthermore, a weighted regression analysis was performed in order to analyze the relationship between the number of dummy species and the mean error, using the true Bray-Curtis similarity as the basis for comparison.
We conducted statistical analyses using the R statistical software version 4.2.2 (R Core Team 2022) and ecodist (Goslee & Urban 2007) and mosaic (Pruim et al. 2017) packages, plotting graphs using library ggplot2 (Wickham 2016).All datasets are presented in Online Resource 1.

Results
In all datasets tested, undersampling generally led to underestimates of assemblage similarities and adding dummy species resulted in a decrease in the severity of underestimations, with larger numbers of dummy species resulting in less severe underestimates.In Figure 1 we provide the dataset Muturi et al. (2006) as an example.The other cases are reported in Online Resource 2.
When considering all datasets, dummy species also contributed to increase the precision level, and led to a reduction in the erratic behaviour of estimations, and this effect was more pronounced with small sample sizes for which the Levene's test for equality of variances revealed significant heterogeneity (P < 0.05) (Figure 2, Table II).
As expected, the simulations resulted in mean values of the relative Good-Turing coverage of detected species which increased with larger numbers of resampled individuals, ranging from 77% ± 18% for sub-samples with 10 individuals, corresponding to a mean coverage deficit of 23%, to 100% ± 0.00% for those with 6000 individuals (coverage deficit 0%) (Table III).
Combining the results from all selected studies showed that with a sample of 10 individuals an underestimation occurred in 80% of the cases, while the values were close to the true Bray-Curtis (±10%) only in 13.2% of cases.Adding dummy species to the assemblages leads to less frequent underestimations resulting in 67.3% of estimates falling below the 10% limit with one dummy species and 48.5% with three dummy species.With five dummy species overestimates became more frequent than underestimates, and the latter were observed in 35% of cases.Increasing the numbers of sampled individuals resulted in higher levels of accuracy of the estimated similarities and in less underestimates.For example, drawing 100 individuals from the complete dataset resulted in 47% of the values to fall within ±10% of the similarity value of the complete dataset.The addition of three or five dummy species increased these values to 52.4% and 54.3%, respectively, and the frequencies of underestimates was reduced to 36.3% and 30.8% respectively.However, to reach an accuracy above 95%, more than 900 individuals needed to be sampled and these returned Bray-Curtis values that on average resulted in under-and over-estimation rates below 2%.
It is important to stress the small sample sizes did not only result in poor accuracy but resulted overwhelmingly in the underestimation of the true value.For 10 individuals drawn, the proportion of underestimation rate was 80.3% and only in 6.7% of cases resulted in an overestimation.For 30 individuals sampled, the cases of underestimation were 69.5% while those of overestimation 7.3%.Introducing dummy species evened out the number of overand under-estimations and at five dummy species both curves started to be comparable (Figure 3).
The correlation and regression analyses between the similarity of the assemblages and the mean error are presented in Figure 4.A low number of individuals sampled (e.g.10) resulted in the highest mean error rates and these were always negative.Interestingly, a clear and statistically significant negative correlation was observed (P < 0.05) in all cases.For 10 individuals drawn, the highest error was observed in assemblages that were most similar and for these underestimates (negative error) was more marked (Figure 4).Increasing the sample size drastically reduced the error and at 900 individuals it was almost zero.Also, Never underestimate biodiversity increasing the number of dummy species resulted in a reduction of overall error, with 10 sampled individuals and one dummy species negative errors still prevailed and a positive error was only observed for assemblages characterized by a very low similarity.Adding three dummy species to the original dataset resulted in an error which was distributed around zero, with positive small deviations for similarities less than 0.4 and negative numbers for higher values.With five dummy species, the error increased for assemblages with low similarities (i.e.<0.4) and was more evenly distributed around zero.Increasing the sample size to 100, 300 and 900 individuals resulted in a decrease of mean error rates and the effect of adding dummy species decreased.However, even at a sample size of 900 adding three or five dummy species resulted in increased error rates for extreme similarity values (Figure 4).
When examining the influence of the number of dummy species on the mean error (Figure 5), it was found that without any dummy species the mean error was negative and increased when adding more dummy species.This increase depended on the number of sampled individuals and this relationship was significant within the range of 10 to 100 individuals.A small number of draws (i.e.10) resulted in a pronounced positive slope of the regression line, which intercepted 0 at two dummy species.With 100 individuals the slope of the regression was much decreased, and the intercept was observed at five dummy species.However, at 300 individuals the slope was not significantly different from 0 (R = 0.16; P = 0.129).
For such denuded samples, Clarke et al. (2006) suggested to add a "dummy species" to the original abundance matrix to improve the behaviour of the Bray-Curtis index and we found that this approach alleviated, to some degree, the effect of undersampling on the estimates of similarity.For example, adding one dummy species to samples of 10 individuals led to 13% less underestimates.This result was consistent even for 300 individuals, a sample size considered sufficient for reliable estimates of similarity by Hardersen et al. (2017), and increased accuracy from 80% (no dummy species) to 80.7% (one dummy species).Adding further dummy species to the datasets resulted in less severe underestimates and at five dummy species under-and overestimates were of similar magnitude.However, for small sample sizes, this high number of dummy species resulted in consistent overestimates at low similarity values (Figure 3) and the mean error increased above zero with increasing numbers of dummy species; the trend shown in Figure 5 continued also for seven and nine dummy species (data not shown).Given that the mean error generated by sub-samples should ideally be zero, we suggest that the most appropriate number of dummy species is between 1 and 3 (Figure 5).

Never underestimate biodiversity
Small sample sizes (e.g. 10 individuals) resulted also in highly variable estimates of similarities (Figure 1 and Table II), which is in accordance with the finding by Beck et al. (2013) revealing that undersampling greatly increases random variation in estimates.This is an undesirable property as a good estimator should show little variation (Walther & Moore 2005).However, this behaviour is to be expected as generally the precision of an estimator increases with the square root of the sampling effort (Marriott 1990;Logan 2010).Also in this case, adding dummy species improved the behaviour of estimates, as the variability of estimates decreased, as also shown by Clarke et al. (2006).This improvement was particularly evident for small sample sizes, but still for 30 and 100 individuals the effect was apparent (Figure 4).
The error introduced by undersampling was negatively correlated to the similarity of assemblages.Thus, the more similar two assemblages were, the more likely estimates were underestimated and this tendency was more severe at low sample sizes.We are not aware that this correlation has been reported before.This could have significant consequences in ecological research, where inadequate understanding of how bias, accuracy, and precision influence the estimation of diversity can result in inappropriate selection of estimators, inconsistent estimation outcomes, and suboptimal decision-making.It would be important to test this behaviour of the Bray-Curtis similarity index also with other approaches as it is undesirable.The addition of dummy species resulted in the mean error to be more evenly distributed around zero, but also resulted in more steep regressions of the correlation of similarity against error.
The Bray-Curtis index is widely used in ecological packages in R (e.g.abdiv -Bittinger 2020; ecodist -Goslee & Urban 2007; vegan -Oksanen 2016; wiqid -Meredith 2020) and has been used in at least 1300 scientific papers since 2000 (based on both Scopus and Web of Science abstracts).Moreover, this similarity index is not only applied to directly compare assemblages but is also commonly used to build dissimilarity matrices, which are the basis for further statistics, such as Principal Coordinate Analysis, Permutational Multivariate Analysis of Variance or Nonmetric Multidimensional Scaling (Kindt & Coe 2005;Borcard et al. 2011;Anderson 2017).The fact that similarities are often underestimated and of low precision is likely to compromise the results of these

668
S. Hardersen and G. La Porta analyses and this becomes more important when only small sample sizes are available.Additional problems stem from datasets comprising high and low similarity values as it is more likely to underestimate similarity values when assemblages are more similar.A further statistical approach that commonly uses Bray-Curtis similarities is the distance-decay plot (Anderson et al. 2011;Wetzel et al. 2012;Filloy et al. 2015;Hardersen et al. 2017).In this case, one undesirable property of the Bray-Curtis similarity index described above leads to more severe underestimates of assemblages that are more similar, resulting in regression lines with a decreased slope.More importantly, more severe underestimates are to be expected when sample sizes are smaller and this tendency is likely to seriously bias the resulting distance decay regression, especially if small and large sample sizes are contained in a single analysis.
Based on our findings, it should be avoided to calculate Bray-Curtis similarities with sample sizes of less than 300 individuals as these are likely to result in underestimates and offer low precision.Reliable estimates can only be computed with datasets which contain more than 300 or 900 individuals and for which the Good-Turing sample completeness resulted approximately 98%; the choice of this minimum sample size depends on what is deemed an acceptable level of precision.This minimum sample size is in the same order of magnitude as the 750-1550 individuals indicated by Schneck and Melo (2010), necessary to adequately estimate resemblance, using the Bray-Curtis index, for macroinvertebrate assemblages in tropical streams.
It is also important to avoid unequal sampling fractions that lead this index to perform erratically (Chao et al. 2006).We also suggest to always add dummy species to the abundance matrix (Clarke et al. 2006) because this results in better and more stable estimates of Bray-Curtis similarities, as underestimates become less likely and precision is increased.As already pointed out by Clarke et al. (2006), differences between the zero-adjusted measure, with a dummy value of 1, and the original Bray-Curtis dissimilarities are slight for samples of counts which contain at least a modest number of individuals.With increasing numbers of individuals sampled, these differences become ever smaller, as is evident from our numerical simulations.It is important to point out that the addition of dummy species alters the resulting Bray-Curtis similarity slightly, but if these are added to all abundance matrices these can validly be compared.The relative values of dissimilarity matter more than their absolute values (Clarke et al. 2006) as most ecologists would be satisfied if dissimilarities of sites in the sample matrix in relation to each other correspond with their relative position derived from complete data (Beck et al. 2013).Similarly, it is often recommended to downweight the contributions of the dominant species by transformation prior to calculating Bray-Curtis similarities (Kindt & Coe 2005;Clarke et al. 2006;Borcard et al. 2011), which also alters the resulting Bray-Curtis similarity.
As shown above, the addition of more dummy species results in less severe underestimates and an increase in precision of the resulting similarity values.At the same time, it increases the negative correlation of the similarity with the mean error and the addition of five dummy species resulted in consistent overestimates at low similarity values.The overestimation of beta diversity is not desirable, as is underestimation.By including an excessive number of dummy species, estimates may be based on artificial and unrealistic conditions, which can result in poor results in ecological research and misinform strategic planning and biodiversity management.Our data seem to indicate that the addition of dummy species to abundance matrices within the range of one to three is a good choice as underestimates of similarities are reduced, precision is increased, and error is more evenly distributed around zero.

Conclusions
In summary, our study shows the importance of considering the error introduced by undersampling on the performance of the Bray-Curtis index and its correlation to the observed similarity when the complete datasets are considered.The results show that undersampling generally resulted in underestimates of assemblage similarities and a possible countermeasure to reduce the severity of this directional bias is the addition of dummy species to the original abundance dataset.This measure is especially important for samples with less than 300 individuals.The Bray-Curtis similarity is frequently employed (Jost et al. 2011;Bacaro et al. 2012), and many textbooks and papers suggest to use this index (e.g.Clarke & Warwick 2001;Kindt & Coe 2005;Schroeder & Jenkins 2018).However, when using this Bray-Curtis similarity, caution is required (Chao et al. 2006;Jost et al. 2011) and by employing sufficiently large sample sizes and adding dummy species, more accurate estimates of beta diversity can be achieved.

Figure 3 .
Figure 3. Accuracy plots, which indicate the proportion of similarity values, calculated for increasing numbers of individuals drawn from the 16 datasets.Black point-ranges represent mean values (±sd) that fall within the range of ±10% around the similarity value calculated with the complete dataset; red dashed lines represent the mean rate of underestimation and blue dashed lines mean rate of overestimation.

Figure 4 .
Figure4.Mean errors on 16 datasets with an increasing number of (i) resampled individuals (left-right) and (ii) dummy species (topdown).All data are means of 1000 random draws and unbiased estimates have a mean error of zero (indicated by a red dotted line).Correlation coefficients r are reported in labels.

Figure 5 .
Figure 5. Boxplot of errors with an increasing number of dummy species and regression lines.All data are means of 1000 random draws.Correlation coefficients and P values are reported in labels.

Table I .
Original abundance (species richness), percentage abundance difference for the tested dataset, number of shared species, and Bray-Curtis similarity values.

Table II .
, P-values of Levene's test obtained by comparing variances estimated adding zero, one, three or five dummy species at different numbers of individuals drawn from the original dataset.
with no species; Battaglia et al. (2005): Many sites with fewer than 10 individuals, and some none at all; Stevens and Connolly (2005): Sites with only one or two taxa and very low densities; Luja et al. (2008): Assemblages on average consist of 7 individuals and

Table III .
Mean values (±sd) of Good-Turing sample completeness at the increasing number of individuals resampled from the species of the original dataset.