Test-retest reliability on the Cambridge Neuropsychological Test Automated Battery: Comment on Karlsen et al. (2020)

Abstract Test-retest reliability is essential to the development and validation of psychometric tools. Here we respond to the article by Karlsen et al. (Applied Neuropsychology: Adult, 2020), reporting test-retest reliability on the Cambridge Neuropsychological Test Automated Battery (CANTAB), with results that are in keeping with prior research on CANTAB and the broader cognitive assessment literature. However, after adopting a high threshold for adequate test-retest reliability, the authors report inadequate reliability for many measures. In this commentary we provide examples of stable, trait-like constructs which we would expect to remain highly consistent across longer time periods, and contrast these with measures which show acute within-subject change in response to contextual or psychological factors. Measures characterized by greater true within-subject variability typically have lower test-retest reliability, requiring adequate powering in research examining group differences and longitudinal change. However, these measures remain sensitive to important clinical and functional outcomes. Setting arbitrarily elevated test-retest reliability thresholds for test adoption in cognitive research limits the pool of available tools and precludes the adoption of many well-established tests showing consistent contextual, diagnostic, and treatment sensitivity. Overall, test–retest reliability must be balanced with other theoretical and practical considerations in study design, including test relevance and sensitivity.


KEYWORDS
Cognition; CANTAB; measurement/statistics; neuropsychology; neuropsychology; tests; test-retest reliability Comment Karlsen et al. (2020) recently published a paper in the journal "Applied Neuropsychology: Adult" on the test-retest reliability of the Cambridge Neuropsychological Test Automated Battery (CANTAB) in 75 healthy individuals assessed twice over a period of three months. The authors define outcome measures achieving acceptable test-retest reliability as those meeting correlation coefficients between testing occasions of at least (Pearson's) r ¼ .75. Karlsen et al. (2020) report that three of fourteen CANTAB outcome measures reach this threshold (21%) and describe inadequate reliability for the remainder of outcome measures (Pearson's r range .39-.73). In the current response to this article we would like to introduce a broader debate around the interpretation of these findings and discuss the application of reliability measurements in cognitive assessments for clinical research.
The test-retest reliability coefficients documented by Karlsen et al. (2020) are broadly in keeping with reliability levels previously reported for CANTAB (Barnett et al., 2010;Cacciamani et al., 2018;Feinkohl et al., 2020;Fowler et al., 1995;Gonçalves et al., 2016;Lowe & Rabbitt, 1998). As noted by Karlsen et al. (2020) themselves, test-retest reliability below r ¼ .75 is not unique to CANTAB but is commonly reported for many traditional neuropsychological measures. This includes well-established traditional neuropsychological tests across a range of cognitive domains, including planning, inhibition and memory (Calamia et al., 2013;K€ ostering et al., 2015;Soveri et al., 2018), as well as in cognitive tests assessed using other computerized cognitive test batteries (Cole et al., 2013). In a meta-analysis by Calamia et al. (2013), the average test-retest reliability of many common cognitive neuropsychological tests on immediate retest was estimated at r ¼ .71, and only around a third of test outcomes managed to reach thresholds of r ¼ .75 and above.
Test-retest reliability is essential to the development of psychometric tools and refers to the reproducibility of two or more measurements using the same tool for the same person under the same conditions, where we would not expect the individual to have changed on the outcome (Aldridge et al., 2017). Good test-retest reliability can give us confidence that the tool accurately represents the individual characteristics of a person at given time, which can allow us to use tools to support diagnostics, differentiation of participant cohorts, and identify genuine change (Aldridge et al., 2017). Given the greater difficulty with reliably detecting thresholds of impairment and change over time with assessments that have lower test-retest reliability, and the need for larger samples to detect group differences for such measures (Charter & Feldt, 2001;Chelune et al., 1993;Soveri et al., 2018), scientists are reasonably inclined to seek out assessments with a high degree of test-retest reliability for their research.
Thresholds for adequate reliability diagnostics vary from textbook to textbook, and paper to paper. The threshold adopted by Karlsen et al. (2020) is one of a range of shorthand rules of thumb (also see Cicchetti, 1994;Hopkins et al., 1990), which each provide differing thresholds for defining acceptable reliability. These are often provided as a matter of opinion without further justification or qualification, and with little or no theoretical basis (Charter & Feldt, 2001). However, when specifying the minimally acceptable threshold for test-retest correlations, it is important to consider the underlying properties of the construct that a given assessment measures. Henry (1959) describes three sources of variance within reliability coefficients: (1) "true" individual difference variance, (2) variation within the individual as well as variation in response to the test situation, which ordinarily cannot be separated, and (3) experimental error. Although reliability coefficients are commonly interpreted as the degree to which a test is free from measurement error, they also incorporate true within-person variability. This is reflected in the decline in test-retest reliability with longer duration between tests (Calamia et al., 2013), likely reflecting an increase in true change over time.
With stable trait-like constructs, test-retest assessments using appropriately sensitive measures are likely to yield higher test-retest reliabilities, since within-individual variance is minimized. Taking an example from the physical health domain, a person's Body Mass Index (BMI) is modifiable in the longer term through exercise, dietary changes and growth, but with accurate measurement is likely to show only modest change over time. As a result, BMI typically shows high test-retest reliability on repeated short-term and longer-term assessment (Intra-class correlations of .95-.96 (Brisson et al., 2018;Leatherdale & Laxer, 2013)). By contrast, blood pressure varies acutely depending on shortterm changes in exercise, stress and anxiety, caffeine, nicotine and alcohol consumption, and body position (standing, seated, lying down), as well as longer-term changes in exercise habits, dietary changes and growth. As a result, testretest reliability of blood pressure measures are typically much lower (r ¼ .41-.76 (Schechter & Adler, 1988;Stergiou et al., 2002)). However, despite this variability in measurements, blood pressure remains an integral part of routine clinical assessments for gauging patient health, and is used as an outcome measure in research and clinical trials, where increased test-retest variability is typically overcome by boosting sample size (Golomb et al., 2008).
Likewise, some cognitive measures are more resistant to variation over time and are more likely to meet high thresholds of test-retest reliability. In the domain of task-assessed cognition, measures of crystallized intelligence (e.g. vocabulary, semantic knowledge) tend to be more stable. In a meta-analysis of cognitive assessments the highest test-retest correlations were obtained for tests of vocabulary and information on the Wechsler Adult Intelligence Scale (r ¼ .90-.91), both tests of semantic knowledge (Calamia et al., 2013).
However for more fluid cognitive domains, such as memory, attention, and processing and response speed, research suggests that cognitive function can vary acutely and systematically in relation to a variety of within-subject changes in response to contextual factors. For example, CANTAB tests tapping into these cognitive domains have been shown to be sensitive to the consumption of caffeinated drinks (Durlach, 1998), sleep-wake cycles (Oosterman et al., 2009), ambient temperature and humidity (Trezza et al., 2015) and acute anxiety induction (Savulich et al., 2019). Whilst these effects provide evidence for the sensitivity of these measures to changes in internal, external and psychological factors, they also represent a challenge when examining test-retest reliability. In test-retest studies contextual control is therefore important (for example, by completing repeated testing at similar times in the day, or maintaining consistency across test occasions in the consumption of stimulating substances (caffeine, nicotine) prior to research participation (Barnett et al., 2010)).
Any psychometric tool should aim to capture the construct of interest with the greatest fidelity, and as little measurement error as possible. However, for many cognitive constructs we may need to be realistic and accept that heightened sensitivity to within-subject factors is likely to increase measurement variability, which comes at a cost to test-retest reliability. As discussed by Guyatt et al. (1987), the usefulness of instruments to detect change in individuals over time not only relies on the test-retest reliability of these instruments, but also on their responsiveness, or their ability to detect differences and change. This distinction is of paramount importance when deploying cognitive assessments in clinical trials aimed to optimize the fidelity of a particular metric in order to be sensitive enough to detect performance changes due to therapeutic intervention.
Test-retest reliability matters since it affects the power of a clinical trial to detect a significant treatment effect, and in group comparison studies to detect significant group differences. However, this can be overcome by designing studies with larger samples that absorb increased outcome variability, whilst maintaining sensitivity to group differences or longitudinal change. Test-retest reliability is important, but it is not everything (Barnett et al., 2010), and should be balanced with other theoretical and practical considerations in study design, including test relevance and sensitivity. Just as for measures of physical health, setting arbitrarily elevated test-retest reliability thresholds for test adoption in cognitive research will limit the pool of available tools, and preclude the adoption of many which are well established and show