Test retest variability in stereoacuity measurements

ABSTRACT Background: A clinician’s choice of stereotest is influenced by the robustness of the measurement, in terms of sensitivity, specificity and test–retest variability. In relation to the latter aspect, there are limited data on the test–retest variability of these new tests and how they compare to the more commonly used stereotests. Therefore, the aim of the study was to determine the test–retest variability of four different measures of stereoacuity (TNO, Frisby, Lang Stereopad and Asteroid (Accurate STEReotest On a mobIle Device)) and to compare the stereoacuity measurements between the tests in an adult population. Methods: Stereoacuity was measured twice using TNO, Frisby, Lang Stereopad and Asteroid. Inclusion criteria included adult participants (18 years and older), no known ophthalmic condition and VA (Visual Acuity) equal to or better than 0.3 logMAR (Logarithm of the Minimum Angle of Resolution) with interocular difference of less than 0.2 logMAR. Bland–Altman analysis was used to assess agreement within and between stereotests. Differences in stereo thresholds were compared using signed Wilcoxon tests. Results: Fifty-four adults (male: 23 and female: 31) with VA equal to or better than 0.3 logMAR in either eye and interocular difference less than 0.2 logMAR were assessed (mean age: 38 years, SD: 12.7, range: 18–72). The test–retest variability of all the clinical stereotests, with the exception of the Lang Stereopad (p = .03, Wilcoxon signed-rank test), was clinically insignificant as the mean bias was equal or less than 0.06 log seconds of arc (equivalent to 1.15 seconds of arc). While the Asteroid test had the smallest variation between repeated measures (mean bias: −0.01 log seconds of arc), the Frisby and Lang Stereopad tests had the narrowest and widest limits of agreement respectively. When comparing results between tests, the biggest mean bias was between Frisby and Lang Stereopad (−0.62 log seconds of arc), and 64.8% and 31.5% of differences were in the medium (21–100” of arc) and larger (>100” of arc) ranges respectively. Conclusion: The TNO and Frisby tests have good reliability but measure stereoacuity over a narrower range compared to the Asteroid which shows less variation on repeated testing but has a larger testing range. The data reported here show varying degrees of agreement in a cohort of visually normal participants, and further investigation is required to determine if there is further variability when stereoacuity is reduced.


Introduction
Stereopsis is the highest grade of binocular vision, which is a measure of a person's ability to detect depth through the visual cortex's processing of disparate retinal images.Assessment of stereoacuity is an integral component of the orthoptic investigation, with a range of clinical tests available.These tests vary significantly in many aspects of the test design, including the range of disparities measured, the presence of monocular cues and the method of presenting the disparate images, either utilizing polarizing, anaglyph, lenticular or physical/"real" depth stimuli. 1 A clinician's choice of test can be influenced by the weaknesses/limitations present in some stereotests.These include the presence of monocular cues, the ability to guess a correct answer and a limited range of disparities measurable, both in terms of the overall range and fixed options within the upper and lower limits.In addition to variations between tests, there are also variations within tests, such as the more recent version of the TNO test results in lower stereoacuity in comparison to the previous version. 2To limit the impact of these factors, modifications have been proposed for some tests.For example, with the Wirt fly test, the largest disparity of 3000" can easily be passed with monocular viewing, but the introduction of additional glasses that provide a monocular view requires patients to consistently provide a positive response only when disparity is present. 3In addition, there are new tests available that aim to address these limitations.The Lang Stereopad minimizes the potential guess rate with up to six stereo cards presented at once, 4 whereas the Asteroid (Accurate STEReotest On a mobIle Device) test uses a glasses-free 3D tablet to present the stimuli within a computer game, designed to be more engaging and uses dynamic random dot pattern eliminating monocular clues.It utilizes the camera to detect test distance which results in automatic calibration of the disparity size, and an adaptive staircase is utilized to determine the stereoacuity threshold. 5he purpose of assessing stereoacuity is to evaluate a patient's current state of binocular vision to determine whether it is normal and whether it has changed in relation to the previous findings.7][8] Given the impact of the results, it is essential that the assessment and interpretation are accurate.Test accuracy can be defined by different measures such as testability (the proportion of people able to successfully complete the test), test-retest repeatability within subject and between testers and the sensitivity/specificity in detecting deficient stereopsis.To be able to interpret an individual test response, normative data are required across all age groups, and to determine whether there has been a change in stereoacuity, the testretest variation (TRV) must be known, as these values vary between tests.Due to changes in the visual system and cognitive development, a normal value for stereopsis improves and variability reduces with increasing age during childhood, which requires normative data over the life span to ensure accuracy in the interpretation of results.][11] As shown, there is evidence relating to the normative data in the adult population; however, it is limited in relation to the TRV.Identical scores on repeated testing have been found in 25-73% of children, [12][13][14][15] with the validity of an abnormal result being considered questionable for some tests due to the considerable variation, 16 but the contribution the variable cognitive abilities of children make on this finding is not known.For adults, only two reports were found which assessed the Randot circles, Asteroid, Frisby and Titmus, 17,18 with none found for the Lang Stereopad or TNO.
As test choice is influenced by the patient's cognitive ability or clinician preference, comparison between tests may also be of interest if different stereoacuity tests are used throughout a patient's care.Reports have shown a moderate-to-good level of correlation between a range of tests, 17 as anticipated given that they are all measures of the same visual function, but variability does exist, in particular, between varying methods of stimulus presentation (e.g.real depth compared to Randot test 19 or contourbased circles and random dot presentation). 20As the Asteroid test is relatively new, the only comparisons found did not include the commonly used Frisby test, thus warranting further investigation.As the Frisby and TNO tests are the most commonly used tests in the UK, 21 comparison with these tests is important to facilitate comparison with clinical standard tests.Therefore, the aims of this study were to determine the test-retest variability of four different measures of stereoacuity (TNO, Frisby, Lang Stereopad and Asteroid) and to compare the stereoacuity values between the tests in an adult population.

Participants
Adult participants were recruited from the friends and family of the investigators and the University of Liverpool network.Inclusion criteria were participants aged 18 years or older and no known ophthalmic conditions.Participants were also required to have VA (Visual Acuity) equal to or better than 0.3 logMAR (Logarithm of the Minimum Angle of Resolution) in either eye and an interocular difference of less than 0.2 logMAR.All participants were required to provide consent prior to participation.This study was approved by the University of Liverpool ethics committee and followed the tenets of the Declaration of Helsinki.

Test procedures
Testing was performed under standard illumination (500 lux).Participants with distance or near refractive errors wore their habitual correction during testing.Uniocular visual acuity was measured (per letter) using the ETDRS (Early Treatment of Diabetic Retinopathy Study) chart for near and distance.Twenty-four different permutations for the order of testing TNO, Frisby, Lang Stereopad and Asteroid (Version 1.0.42)were used to minimize any bias.The tests were repeated for each participant resulting in two measurements per stereotest.Stereoacuity threshold was recorded for the smallest disparity correctly identified on three presentations for the Lang Stereopad and Frisby.The Lang Stereopad was measured as a four-alternative forced choice test, and Frisby was measured to threshold at 10 cm increments.If a participant gave an incorrect response, the previous distance was repeated to ensure that threshold was reached at that level.Participants were required to identify both targets (plates V-VII) at each threshold for TNO.
Participants were asked to view each of the stereotests while placing their head on a chin rest to avoid parallax and use of monocular cues.Test distances were marked on a table, where the tests were placed on a box at eye level to ensure accuracy.For the Asteroid test, a distance tracker sticker was placed on the participant's forehead, and the participant was instructed to hold the pad.

Statistical methods
Each test had a different range of measurable stereoacuity in seconds of arc (TNO: 480-15", Frisby: 600-20", Lang Stereopad: 800-50" and Asteroid: 1000-1.25").Therefore, an arbitrary high value of 3000" was used to indicate that there was no measurable stereopsis.All stereo data were log-transformed to allow statistical analysis as the log thresholds are closer to a normal distribution.Normality tests were conducted using the Kolmogorov-Smirnov test, and despite log transformation, the stereo thresholds were not normally distributed.Therefore, non-parametric tests were used during the analysis.Differences in thresholds between the stereotests were examined with the use of Wilcoxon signed-rank test.A Bonferroni correction for multiple comparisons was used; therefore, significance was adjusted to p < .008.Bland-Altman analysis was employed for assessing agreement within and between the stereotests.The upper and lower limits of agreement have been defined as ±1.96 standard deviations.

Test-retest variability
Table 1 illustrates the stereo threshold data for each stereotest measured during tests 1 and 2. Stereo data are presented as median seconds of arc and the equivalent log as the data were not normally distributed.The closest agreement (smallest mean bias) between tests 1 and 2 is for the Asteroid test followed by the Frisby and TNO and the largest variability being for the Lang Stereopad test.No significant differences were found between stereo  1). Figure 1 compares the results from tests 1 and 2 for each of the stereotests.The numbers to the left of the points on each of the figures represent more than one data point and where there is no number that represent a single data point.There are no overlapping points on the Asteroid test-retest figure.Frisby has the least amount of variability between tests 1 and 2, and while there are no overlapping points for the Asteroid in Figure 1, the points are closely clustered, and the limits of agreement are less than those of Lang Stereopad (Table 1).

Between test comparisons
Paired comparisons across stereotests were conducted using the value obtained on test 2 to minimize any practice effects.Although most pairwise comparisons were significant (p < .008,Wilcoxon signedrank test with Bonferroni correction) with the exceptions of the TNO vs Asteroid and Lang Stereopad vs Asteroid, this may be due to the spread of values for the Asteroid test (Table 2).The median thresholds measured for TNO and Asteroid were very similar to a mean difference of −0.07 logarcsec (Table 2).However, while stereoacuity measured with Lang Stereopad was similar to the Asteroid (mean bias: 0.16), the measurements with Lang Stereopad (2.00 log = 100" of arc) were slightly worse than the Asteroid (1.91 log = 81.50" of arc) which is significant at the conventional level of P < .05though not after the Bonferroni correction (Tables 1 and 2).
As there are overlapping data points on the scatterplots (Figure 2), Figure 3 summarizes the test differences into small (<21"), medium (21-100") and large (>100").When comparing the TNO and Frisby, two-thirds of the differences were either small or medium (Figure 3), whereas large differences (41%) were seen between Lang Stereopad and Asteroid (Figure 3) despite the nonsignificant comparison between them (Table 2).Similarly, despite there being no significant difference between TNO and Asteroid on pairwise testing, 59% and 37% of differences between the two tests are medium and large, respectively (Figure 3).

Discussion
In this cohort of visually normal adult participants, all stereotests, except the Lang Stereopad, had minimal (≤0.06 log seconds of arc equivalent to 1.15 seconds of arc) test-retest variability.However, when comparing results between tests, there were significantly different levels of stereoacuity measurements between each test.
On test-retest analysis, the Asteroid test had the smallest mean bias followed by the Frisby and TNO, indicating that it is the least variable on repeated testing.The wider limits of agreement for Lang and Asteroid may be due to the wider range over which stereoacuity can be measured with these tests, with the range of the Lang being 750" and Asteroid 999", compared to the 465" of TNO and 580" of Frisby.The repeatability for Frisby (mean bias: 1" of arc) is in agreement with a study reporting good repeatability for Frisby (mean bias: 2" of arc) in a study of young adults with normal binocular vision, and 89% of their participants achieved the lowest disparity (20" of arc). 18The limits of agreement for Asteroid are slightly wider, but the mean bias in our study was smaller than that reported by McCaslin et al. 17 Their study had 39 participants, comprising of children and adults younger than 50 years of age (mean bias: 0.058 log arcsec, 95% limits of agreement: ±0.370).While there was less than 0.1 log seconds (1.25" of arc) mean bias for TNO, Frisby and Asteroid, there was a statistically significant difference in test-retest for the Lang Stereopad (mean bias = 0.14 log seconds equivalent to 1.39" of arc, p = .03)and is contrary to Rowe et al.'s 4 analysis on a subset of their participants (N = 36, p = .425).However, the mean bias (1.39") is not clinically significant and may be explained by the practice effect as the interquartile range decreased on the second attempt.
When comparing measurements between tests, TNO and Frisby appear the closest in terms of the biggest proportion of results with small or medium difference (Figure 3), despite the different methods of presentation.Our data (Figure 3) support the typical finding of TNO measuring higher (worse) thresholds of stereoacuity in observers with binocular vision, 22 and a possible explanation is the use of anaglyph 3D glasses as this has been shown to produce artifacts when testing binocular vision due to the potential interocular contrast differences 23 and reduction in binocular motor fusion. 22The red-green glasses are also dissociative due to the color mismatch which has been shown to reduce stereopsis. 24Red-green glasses also have been reported to allow significant cross talk whereby part of the image that is presented to one eye passes through the filter of the another eye and reduces stereoacuity. 25Frisby, on the other hand, is described as measuring "real depth", 19 where it does not involve the use of polarizing filters or anaglyph glasses to appreciate depth.Serrano-Pedraza et al. 25 described Frisby as using physical depth, whereby motion parallax can be used to detect the target without using stereopsis.However, in our study, a chin rest was used to minimize the impact of motion parallax on the stereo threshold.Hence, the differences are due to inherent modes of presentation of the stereotests, for example, anaglyph versus "real depth" measurements as opposed to an artifact of the methodology used in this study.While there was no significant difference between TNO and Asteroid with a very small mean bias, a large proportion of individuals (59%) had medium differences (21-100" of arc) in stereo measurements between TNO and Asteroid (Figure 3).This may be explained by the difference in the range measured by each of these tests, where TNO and Frisby have narrower testing ranges compared to the Lang Stereopad and Asteroid (TNO: 480-15" and Asteroid: 1000-1.25")as well as the different modes of presentation, for example, dot size or static vs dynamic random dot.Asteroid uses a dynamic random dot stereogram to eliminate monocular cues and will produce erroneous results if it is held too close or tilted to detect the target using motion parallax. 5The higher thresholds obtained with the Asteroid have been explained by the differences in dot size compared to other tests based on random dot stereograms and that it is dynamic as opposed to static in the other tests. 5

Study limitations
Testing visually normal participants does allow us to evaluate the test efficacy, but as it is known, that reduced stereoacuity can impact on the variability. 1ence, further evaluation is required in a wider clinical cohort of varying abilities and to determine how these measures would be affected by different ophthalmic conditions, for example, in patients with amblyopia and/or strabismus.There is evidence to suggest that stereoacuity declines with age, 10,11,26,27 but as comparisons are within participants, this does not impact on the conclusions of this study.A further potential source of variability may be explained by the fact that stereoacuity was measured by two students studying for a Nuffield Science Project, and while they were trained and had a strict protocol to adhere to, there is potential for variation from an experienced orthoptist.

Summary
Standard clinical tests, TNO and Frisby have good reliability but cannot be used interchangeably.Therefore, stereotest selection for patients should remain constant between visits.The Asteroid had good reliability and compared well with the TNO in adults with good visual acuity.However, further evaluation of these tests is required to determine reliability in larger cohorts in patients with impaired stereoacuity and in older adults with normal visual acuity and binocular vision.

Figure 1 .
Figure 1.Scatterplot of each stereotest, test 1 vs test 2 for all stereotests.Numbers to the left of the point indicate number of overlapping points and those without a number indicate a single value.

Figure 2 .
Figure 2. Scatterplot of comparison of each stereotest.Numbers to the left of the point indicate number of overlapping points and those without a number indicate a single value.
thresholds measured during tests 1 and 2 for all tests except the Lang Stereopad (p = .03,Wilcoxon signed-rank test).However, the difference between stereoacuity between tests 1 and 2 for TNO did not reach statistical significance (p = .05,Wilcoxon signed-rank test), with the Interquartile Range [IQR] being slightly larger on the second measurement, indicating that there is some variability.The IQR is lower on the second attempt (test 1 = 150" of arc and test 2 = 350" of arc) for the Lang Stereopad, and more participants (N = 19) had a better median stereo threshold during test 2, suggesting a possible practice effect.There is a very little change in the IQR for Asteroid (p = .99,Wilcoxon signed-rank test) and Frisby (p = .14,Wilcoxon signed-rank test) between tests 1 (IQR = 72.75" of arc) and 2 (IQR = 108.50" of arc), indicating that there is less variability (Table

Table 2 .
Paired analysis between stereotests (Wilcoxon signed-rank test) (N = 54).*Significant at <0.008 (Bonferroni correction for multiple comparisons).A positive mean bias means the first test had a higher stereoacuity score.