Temporal Stability of the Dutch Version of the Wechsler Memory Scale—Fourth Edition (WMS-IV-NL)

Objective: The Wechsler Memory Scale—Fourth Edition (WMS-IV) is one of the most widely used memory batteries. We examined the test–retest reliability, practice effects, and standardized regression-based (SRB) change norms for the Dutch version of the WMS-IV (WMS-IV-NL) after both short and long retest intervals. Method: The WMS-IV-NL was administered twice after either a short (M = 8.48 weeks, SD = 3.40 weeks, range = 3–16) or a long (M = 17.87 months, SD = 3.48, range = 12–24) retest interval in a sample of 234 healthy participants (M = 59.55 years, range = 16–90; 118 completed the Adult Battery; and 116 completed the Older Adult Battery). Results: The test–retest reliability estimates varied across indexes. They were adequate to good after a short retest interval (ranging from .74 to .86), with the exception of the Visual Working Memory Index (r = .59), yet generally lower after a long retest interval (ranging from .56 to .77). Practice effects were only observed after a short retest interval (overall group mean gains up to 11 points), whereas no significant change in performance was found after a long retest interval. Furthermore, practice effect-adjusted SRB change norms were calculated for all WMS-IV-NL index scores. Conclusions: Overall, this study shows that the test–retest reliability of the WMS-IV-NL varied across indexes. Practice effects were observed after a short retest interval, but no evidence was found for practice effects after a long retest interval from one to two years. Finally, the SRB change norms were provided for the WMS-IV-NL.


INTRODUCTION
In clinical practice, repeated neuropsychological assessments across time are often necessary to monitor patients with a variety of neurological and psychiatric disorders (Heilbronner et al., 2010;Lezak, Howieson, Bigler, & Tranel, 2012). For instance, repeated assessments are used to monitor the efficacy of cognitive rehabilitation, pharmalogical, or neurosurgical treatments (Chelune, 2003;Chelune, Naugle, Lüders, Sedlak, & Awad, 1993;Schoenberg et al., 2012). Also, serial assessments provide insight in the course of cognitive decline in patients with neurodegenerative disorders, such as dementia (Duff, Chelune, & Dennett, 2012). As memory problems are the most prevalent cognitive deficits in a variety of clinical pathologies, reliable repeated assessment of memory functioning plays a crucial role in neuropsychological evaluations. Therefore, there is a demand for evaluation of test-retest reliability and practice effects of memory tests.
Previous research has shown that there are differences in test-retest reliability among different cognitive domains. The test-retest reliability of memory tests is generally found to be poorer than that of tests assessing other cognitive functions (Calamia, Markon, & Tranel, 2013;Ivnik et al., 1995;McCaffrey et al., 1993;McCaffrey & Westervelt, 1995), especially when the retest interval is longer (Domino & Domino, 2006). It has been suggested that normal human memory performance is variable and that caution needs to be exercised when interpreting repeated memory assessments , as poor reliability estimates may cause problems such as failing to detect actual changes in research or in clinical practice.
In addition to test-retest reliability from a pure psychometric perspective, practice effects may further complicate the interpretation of repeated memory assessment. Specifically, practice effects are improvements in the test performance on re-evaluations that do not reflect genuine improvement in the underlying construct, but may be related to other processes such as recollection of specific items, learned test-taking strategies, or familiarity with the test-occasion (Calamia, Markon, & Tranel, 2012). Notably, if all participants scores increase or decrease in the same amount, the reliability is still high. Therefore, it is possible that a test has a high reliability, but at the same time reveals large practice effects.
By definition, memory tests are especially susceptible to practice effects, because repeated assessment will enhance retrieval of specific items (Lezak et al., 2012;McCaffrey et al., 1993). However, findings on practice effects in the literature are difficult to compare, since practice effects are influenced by many different factors which vary across studies. For example, population-specific effects such as younger age and higher intellectual ability are related to larger practice effects McCaffrey & Westervelt, 1995;Mitrushina & Satz, 1991;Rabbitt, Diggle, Smith, Holland, & Mc Innes, 2001;Rapport et al., 1997). Moreover, an increasing number of re-administrations (Collie, Maruff, McStephen, & Darby, 2003;Ferrer, Salthouse, Stewart, & Schwartz, 2004;Theisen et al., 1998) and shorter lengths of the test-retest interval are associated with larger practice effects (Calamia et al., 2012).
Until now, only two studies have addressed retest effects of the latest edition of the WMS, the Wechsler Memory Scale-Fourth Edition (WMS-IV), which are reported in the test's Technical Manual (Holdnack & Drozdick, 2009) and in the Advanced Clinical Interpretation publication (Holdnack et al., 2013). These studies used the test-retest sample of the US WMS-IV, which consists of 244 participants (173 completed the Adult Battery and 71 the Older Adult Battery). The WMS-IV was administered twice after test-intervals from 2 to 12 weeks (M = 23 days). For the Adult Battery, the testretest reliability coefficients for the index scores ranged between .81 and .83, and average increases ranged from 4.3 points (Visual Working Memory Index) to 13.7 points (Delayed Memory Index). For the Older Adult Battery, the test-retest reliability coefficients for the index scores ranged from .80 to .87, and average increases ranged from 10.6 points (Auditory Memory Index) to 12.4 points (Immediate Memory Index). Based on these results, considerable increments in the WMS-IV index scores seem to occur after short time intervals of several weeks in a healthy sample. Accordingly, Holdnack and colleagues (2013) provided regression equations for all WMS-IV indexes and subtests that can be used to predict reliable change in repeated assessments.
However, it is still unclear whether these performance increments continue to persist after longer retest intervals. It has been suggested that practice effects diminish when the time passes, but several studies have shown that practice effects may persist over longer time intervals up to 7-13 years (Basso, Carona, Lowery, & Axelrod, 2002;Salthouse, 2010;Salthouse, Schroeder, & Ferrer, 2004;Salthouse & Tucker-Drob, 2008). With respect to previous versions of the WMS, few studies have reported practice effects after long test-retest intervals in healthy participants Dikmen et al., 1999;Lo et al., 2012;Mitrushina & Satz, 1991). Overall, these studies found that the magnitude of practice effects after longer retest intervals varies per subtest (e.g., long lasting practice effects were commonly found on the subtests Logical Memory, and Verbal Paired Associates), and were also influenced by demographic variables such as age and intelligence (e.g., long-lasting practice effects were generally seen in younger adults and higher educated participants).
The present study examined the test-retest reliability and practice effects of the Dutch version of the WMS-IV (WMS-IV-NL: Hendriks, Bouman, Kessels, & Aldenkamp, 2014;Wechsler, 2009). Two different test-retest intervals were studies; one group of healthy participants was re-examined after a short retest interval (3-16 weeks) and another group after a long retest interval (12-24 months). It is expected that the test-retest reliability and practice effects after a short retest interval will be comparable to those found in the US WMS-IV test-retest sample (Holdnack & Drozdick, 2009;Holdnack et al., 2013). In addition, it is expected that the test-retest reliability may be somewhat lower after a long retest interval compared to a short retest interval. Moreover, we expect long-lasting learning effects on the verbal and visual episodic memory tasks (i.e., Auditory Memory Index, Visual Memory Index, Immediate Memory Index and Delayed Memory Index), but not on the visual working memory tasks (i.e., Visual Working Memory Index). Furthermore, we generated standardized regression-based (SRB) change norms to provide statistical directions for detecting significant changes at an individual level for use in clinical practice.

Participants
The sample consisted of 234 healthy persons (age range 16-90 years, mean age = 59.55, SD = 21.36; 100 males) from the WMS-IV-NL standardization sample (Hendriks et al., 2014). Of these participants, 118 completed the WMS-IV-NL Adult Battery, and 116 completed the WMS-IV-NL Older Adult Battery. Participants from different age groups and with different educational levels were recruited by trained assessors through their network, via advertisement and via a database of Pearson Assessment, the Netherlands. The sample-selection was based on the Dutch population according to census results from the Central Office for Statistics of the Netherlands (CBS, 2011). The sample was stratified according to age, sex, education level, and ethnicity; and the participants were only included if they met the inclusion criteria: primary language is Dutch; no significant hearing of visual impairment; no psychiatric or neurologic disorder, no substance abuse affecting cognitive functioning; and no use of medicines affecting cognitive functioning.
Of these participants, 134 (49.3% Adult Battery and 50.7% Older Adult Battery) were reassessed after a short interval of approximately 8.48 weeks (SD = 3.40, range 3-16 weeks), and 100 (52% Adult Battery and 48% Older Adult Battery) were reassessed after a longer interval of approximately 17.87 months (SD = 3.48, range 12-24 months). With respect to the frequency distribution, neither the short nor the long-interval data were skewed (skewness and kurtosis coefficients >−1 and <1). Participant characteristics are summarized in Table 1. The WMS-IV-NL standardization study was approved by the Institutional Review Board of Radboud University, Nijmegen and written informed consent was obtained.

Neuropsychological tests
All participants were administered the WMS-IV-NL. This memory battery is divided into an Adult Battery for use in participants aged 65-90, and an Older Adult  Nelson & Willison, 1991;Schmand et al., 1992). Battery for use in participants aged 65-90. The WMS-IV-NL Adult Battery comprises one optional subtest, the Brief Cognitive Status Exam (BCSE), and a total of six primary subtests of which two visual working memory tests: Spatial Addition (SA) and Symbol Span (SSP), and four subtests with immediate and delayed recall conditions: Logical Memory I and II (LM), Verbal Paired Associates I and II (VPA), Visual Reproduction I and II (VR), and Designs I and II (DE). These six primary subtests contribute to five index scores: Auditory Memory (AMI), Visual Memory (VMI), Visual Working Memory (VWMI), Immediate Memory (IMI), and Delayed Memory (DMI). The WMS-IV-NL Older Adult Battery comprises the BCSE and a selection of four primary subtests (SSP, LM, VPA and VR) and four index scores (AMI, VMI, IMI, and DMI).
The Dutch version of the WMS-IV was developed to be equivalent to original published US version of this test battery. The nonverbal visual stimuli are identical, and the instructions, auditory stimuli and scoring criteria were translated and adapted to the Dutch language. Pilot studies (first pilot study n = 60; second pilot study n = 120) were performed to check and improve the Dutch language adaptation of the WMS-IV. Moreover, an expert group consisting of clinical neuropsychologists from the Netherlands and Belgium checked the Dutch adaptation after both pilot studies (see also Hendriks et al., 2014; for a detailed description of the development of the WMS-IV-NL). Moreover, a previous study revealed that the factor structures of the Dutch and US WMS-IV standardization samples were invariant, which strengthens the case of equivalence (Bouman, Hendriks, Kerkmeer, Kessels, & Aldenkamp, 2015).
Also, the Dutch version of the National Adult Reading Test (NART) was used as an estimate of intelligence (IQ) (Nelson & Willison, 1991;Schmand, Lindeboom, & Van Harskamp, 1992) Procedures Administration of the WMS-IV-NL was performed in accordance with the test manual (Hendriks et al., 2014;Wechsler, 2009). The trained assessors were neuropsychologists or research assistants who completed an interactive training about the WMS-IV-NL administration and scoring. Moreover, their performance was monitored and evaluated before and at multiple times during the study. The standardization study of the WMS-IV-NL was accomplished by 93 independent trained assessors, the short-term retest assessment was performed by 31 assessors, and the long-term retest assessment was performed by 9 assessors. All participants in the short-term group were tested by the same assessor twice. However, due to logistic constraints, only 21 participants in the long-term retest group were tested by the same assessor twice. 1

Statistical analyses
Demographic variables were compared for the groups who were re-evaluated after a short or a long retest interval using analyses of variance (age and NART IQ) and chisquared tests (sex and education level). For all analyses we used the scaled scores to enhance the comparability to the test-retest studies reported for the US WMS-IV (Holdnack & Drozdick, 2009;Holdnack et al., 2013). Also, these age-corrected scores are more insightful for clinicians. All analyses were performed using the Statistical Package for Social Sciences (SPSS) version 19.0, and all effects are reported as significant at p < .05.
Test-retest reliability of the WMS-IV-NL was assessed by Pearson correlation coefficients between the scores from the baseline and re-evaluations (i.e., short-and long-term intervals). Moreover, the reliability coefficients of the short-and long-term retest groups were compared. Also, the reliability coefficients of the short-term retest group were compared to the test-retest reliabilities reported in the US WMS-IV-NL Technical Manual (Holdnack & Drozdick, 2009) using Fisher r-to-z transformation.
To examine the practice effects, we conducted a number of steps. First, we compared the baseline measures of the short-and long-term retest groups using a one-way between-groups multivariate analysis of variance (MANOVA) with Interval (2 levels: short vs. long) as between-subject factor and baseline measures of the index scores (5 levels: AMI, VMI, VWMI, IMI, and DMI) as dependent variables. Next, in order to study whether the short-and long-term retest groups showed different performance increments across time, we conducted a mixed between-within subjects MANOVA with Interval (2 levels: short vs. long) as the between-subject factor, Time (2 levels: baseline vs. re-evaluation) as the within-subject factor, and Index Scores (5 levels: AMI, VMI, VWMI, IMI, and DMI) as dependent variables. Subsequently, we performed paired samples t-tests for the separate groups (i.e., short-and long-term retest) with Time (2 levels: baseline vs. re-evaluation) as within-subject factor and each of the five WMS-IV-NL index scores as dependent variable (Corrected alpha from .01 to reduce the risk of Type I errors).
In addition, we used a multivariate SRB approach to determine reliable change on the WMS-IV-NL index scores after short-and long-term retest intervals. According to the procedure described by McSweeny, Naugle, Chelune, and Lüders (1993), multiple regression analyses were employed to derive equations for predicting WMS-IV-NL index scores at re-evaluation from baseline test performance and other predictors. Specifically, hierarchical regression models were performed with the retest WMS-IV-NL index score as dependent variable, and baseline scores, test-retest interval (days), and demographic variables (age, education level, and sex) as predictors. Education level (low, average, and high) was classified according to the Central Office for Statistics of the Netherlands (CBS, 2011), which is based on the International Standard Classification of Education (ISCED: United Nations Educational, Scientific and Cultural Organization Institute for Statistics (UNESCO-UIS, 2011). Only predictor variables that were significant at the .05 level were retained in the model. Next, the intercept and regression coefficients form these models were used to estimate the predicted retest index scores for all participants, and the 90 and 95% confidence intervals were applied to all individual's to determine base rates of significant improvements, declines, and stability on the WMS-IV-NL scores (see also a procedure utilized by .

RESULTS
The groups did not significantly differ with respect to age, sex distribution, education level, and NART IQ score (p < .05) (see Table 1).

Test-retest reliability
The mean WMS-IV-NL index scores across time for both the short-and longterm retest groups are presented in Table 2. Additionally, the mean WMS-IV-NL subtest scores across time for the short-and long-term retest groups can be found in Supplementary Table 1. The correlations between the scores from the baseline and re-evaluations were all significant (p < .001). Specifically, for the short retest interval, we found adequate to good correlations for the WMS-IV-NL index scores (ranging from r = .74 for VMI Adult Battery to r = .86 for AMI Older Adult Battery), with the exception of the VWMI in the Adult Battery (r = .59). With the exception of the VWMI in the Adult Battery, all reliability coefficients are comparable to the test-retest reliabilities reported in the US WMS-IV-NL Technical Manual (p < .05) (Holdnack & Drozdick, 2009). For the long retest interval, we found somewhat lower (but not statistically different) correlations on most of the index scores than after a short retest interval (ranging from r = .56 for VWMI Adult Battery to .77 for VMI Adult Battery) (see Table 2).

Practice effect
Adult Battery. Table 2 shows the mean WMS-IV-NL Adult Battery index scores across time for both groups (i.e., short-and long-term intervals). The groups (Interval: short vs. long) did not differ significantly at the baseline measures.
The mixed factor MANOVA revealed a significant main effect for Interval (F(1, 112) = 9.13, p < .003, g 2 p = .08), with participants who were re-evaluated after the short retest interval performing significantly higher than participants who were re-evaluated after the long retest interval (p < .001). In addition, a main effect of Time (F(1, 112) = 48.48, p < .001, g 2 p = .30) was observed. Contrast analyses revealed that overall, participants performed better at the re-evaluation than at the baseline measure (p < .001). No significant main effect was found for Index Scores (F(2.092, 234.345) = 2.42, g 2 p = .02). Additionally, the interaction effect of Interval × Index Scores was not significant, F(2.092, 268.207) = 3.15, g 2 p = .03. Interval and Time showed significant interactions, F(1, 112) = 35.74, p < .001, g 2 p = .24. That is, participants who were re-evaluated after the short retest interval performed better the second time (p < .001), whereas the performance of participants who were re-evaluated after the long retest interval did not differ between the two assessments. Also, the interaction between Time and Index Scores was significant, F(2.395, 268.207) = 3.15, p < .036, g 2 p = .03. The three-way interaction Interval × Time × Index Scores was not significant, F(2.395, 268.207) = 2.22, p = .101, g 2 p = .02. Subsequently, Table 2 shows that with the exception of the VWMI, performance on all index scores increased significantly from baseline to re-evaluation after a short interval. Specifically, the overall group showed a mean increase of approximately 10, 9, 4, 11, and 10 points on AMI, VMI, VWMI, IMI, and DMI, respectively. In contrast, the performance on all WMS-IV-NL index scores did not differ between the baseline and re-evaluation assessment after a long interval (mean overall group difference between −1 and 1 points). Table 3 shows the percentages of participants gaining 5, 10, 15, 20, 25, and 30 points per Index Score.
Older Adult Battery. Table 2 shows the mean WMS-IV-NL Older Adult Battery index scores across time for both groups (i.e., short-and long-term intervals). A significant main effect of Interval (short vs. long) was found for the baseline measures, F(4, 111) = 3.24, p < .015, g 2 p = .10, with the long-term retest group performing significantly higher at the baseline measures of AMI, IMI, and DMI.
The mixed factor MANOVA revealed a significant main effect for Time (F(1, 110) = 17.69, p < .001, g 2 p = .14). Contrasts revealed that participants performed better at the re-evaluation than at the baseline measure (p < .001). No significant main effects were found for either Interval, F(1, 110) = .08, g 2 p < .001, nor Index Scores, F(1.368, 150.533) = .43, g 2 p < .004. Additionally, the interaction effect of Interval × Index Scores was not significant F(1.368, 150.533) = 1.55, g 2 p = .01. Interval and Time showed significant interactions, F(1, 110) = 27.73, p < .001, g 2 p = .18. That is, participants who were re-evaluated after the short retest interval performed better the second time (p < .001), whereas the performance of participants who were re-evaluated after the long retest interval did not differ between the two assessments. Also, the interaction between Time and Index Scores was significant, F(1.662, 182.782) = 5.48, p < .008, g 2 p = .05. The three-way interaction Interval × Time × Index Scores was not significant, F(1.662, 182.782) = 1.21, g 2 p = .01. Furthermore, Table 2 shows that the performance on all index scores increased significantly from baseline to re-evaluation after a short interval. Specifically, the participants showed a mean overall group increase of approximately 6, 9, 8, and 8 points on AMI, VMI, IMI, and DMI, respectively. In contrast, the performance on all index scores did not differ between the baseline and re-evaluation assessment after a long interval (overall group mean differences between −4 and 2 points). Table 3 shows the percentages of participants gaining 5, 10, 15, 20, 25, and 30 points per Index Score.

Regression-based measures for change
Results of the regression analyses for predicting retest scores on the WMS-IV-NL index scores are provided in Table 4. In addition, regression-based results for the WMS-IV-NL subtest scores after a short-and a long-term retest interval can be found in Supplementary Tables 2 and 3. Baseline performance was a significant predictor of the retest score for all WMS-IV-NL index scores in both batteries after both short-term and long-term retest intervals. The other predictors were only entered in a selection of equations (see Table 4): age accounted for <5% of the statistical variance in the two equations; and sex accounted for <5.2% of the statistical variance in the equation for VWMI in the Adult Battery after a long retest interval. Moreover, the variables test-retest interval and education level did not meet the criteria for inclusion as predictor in any of the regression equations. In order to calculate the predicted retest score, clinicians may use the regression equations provided in Table 4.
Next, the following equation was used to calculate standardized z-scores: zscore = (Y o − Y p )/SE est , where Y o is the observed retest score, Y p is the predicted retest score, and SE est is the standard error of the estimate from the regression analysis. These z-scores provide individual level determination of change, and z-scores exceeding ±1.64 are significant with an 90% confidence interval and those exceeding ±1.96 are significant with a 95% confidence interval. The base rates of these confidence intervals for the WMS-IV-NL index scores are presented in Table 5.

DISCUSSION
The present study provides test-retest reliability estimates, examined practice effects and presented SRB change norms for the WMS-IV-NL Adult and Older Adult Batteries using large independent samples of healthy participants who were re-assessed either after a short retest-interval (3-16 weeks) or a long retest-interval (12-24 months). To our knowledge, this study is the first to examine the magnitude of test-retest reliability and practice effects for the WMS-IV over longer time intervals than three months.
Consistent with the results reported in the Technical Manual of the US WMS-IV (Holdnack & Drozdick, 2009), short-term test-retest reliability estimates for most of the WMS-IV-NL index scores were adequate to good, with the exception of a low test-retest reliability for the VWMI in the Adult Battery. Long-term test-retest reliability coefficients were generally lower than the short-term estimates, which is in agreement with the notion that test-retest reliability decreases as the retest interval is longer (Domino & Domino, 2006). With respect to previous versions of the WMSs, poor test-retest reliability estimates were reported after a longer retest interval for the WMS (Haltstead-Reitan Neuropsychological Test Battery/WMS: Dikmen et al., 1999) and WMS-III (Lo et al., 2012). Furthermore, similar findings were demonstrated for the Auditory Verbal Learning Test (Geffen, Butterworth, & Geffen, 1994;Uchiyama et al., 1995) and Hopkins Verbal Learning Test (Rasmusson, Bylsma, & Brandt, 1995).
As expected, practice effects for the WMS-IV-NL index scores were most prominent after a short retest interval. Upon re-examination after such short time periods, healthy participants seem to benefit greatly from the first administration. In agreement with previous work (Basso et al., 2002;Holdnack & Drozdick, 2009;Lo et al., 2012;Wechsler, 1997), no significant practice effects were observed for the working memory Notes: SE est = standard error of the estimate; β = unstandardized beta (slope); testinterval was measured in days; age was measured in years; education level was coded 1: low, 2: average, 3: high; sex was coded 1: male, 2: female. All index equations use age adjusted standard scores.

S40
ZITA BOUMAN ET AL. measure. Furthermore, we found that practice effects were more pronounced in the Adult Battery than in the Older Adult Battery (McCaffrey et al., 1993;McCaffrey & Westervelt, 1995;Mitrushina & Satz, 1991;Tulsky & Zhu, 1997). Presumably, healthy participants with optimal memory function are able to remember specific stimuli of the episodic memory subtests or effective test-taking strategies which may have resulted in performance increments. Furthermore, they may have benefitted from familiarity with the test procedure (Anastasi & Urbina, 1997;Calamia et al., 2012).
Our findings extend those of previous research (Holdnack & Drozdick, 2009;Holdnack et al., 2013) in that we examined repeated assessments of the WMS-IV-NL Adult and Older Adult Batteries after a long retest interval, whereas previous studies only used re-assessments up to three months. We found no evidence for practice effects for the long-term retest group. That is, performance on all WMS-IV-NL index scores remained stable across the baseline and re-evaluation after a long interval from 12 to 24 months (overall group mean differences between −1 and 1). This suggests that the performance on the WMS-IV-NL after a long retest-interval of 12-24 months is influenced to a lesser extent by factors such as recollection of specific test items, test-taking strategies, or familiarity with the test-occasion. In the future, it would be interesting to evaluate whether such a long time interval may also diminish effects of practice in other neuropsychological measures.
For clinicians who are interested in determining whether a patient's change in performance on the WMS-IV-NL index scores is clinically meaningful and reliable, we have provided SRB equations in Table 4. These SRB equations correct for practice effects using an individual's baseline performance as a predictor of the retest WMS-IV-NL index score Temkin et al., 1999). Moreover, it is possible to correct for other demographic variables that could potentially impact memory performance at re-evaluation, such as the test-retest interval, age, education level, and sex. One could argue that simple reliable change approaches that reveal cut-offs may be more easily used in clinical practice, but they do not correct for practice effects, regression to the mean and other potential predictors. Furthermore, another advantage of SRB change norms is that they are converted into z-scores, i.e., a common metric that allows us to make direct comparisons of changes in performance across different neuropsychological tests .
Also, we took into account other variables, such as test-retest interval and the demographic variables age, education level, and sex. These variables may affect repeated assessments using memory tests. Our results show that these variables did not have significant effects on most of the regression equations in our study. These findings are in agreement with several previous studies reporting that baseline measures alone predict a large amount of the variance and that subsequent variables (e.g., retest interval and demographics) predict none or only small amounts of the variance .
One potential confound of the current study is that the short interval group was assessed by the same examiner twice, whereas the assessments in the long-term group were mainly performed by two different examiners. However, because there were no differences in gains when the WMS-IV-NL was administered twice by the same examiner or by two different examiners, it is unlikely that this has influenced the results. Furthermore, assignment of participants to either the short or long retest interval group was done pseudo-randomly; first, the short-interval substudy was performed by asking participants in the normative sample whether they were willing to take part in this test-retest study (based on stratification criteria and availability). Subsequently, the long-interval substudy was performed in a similar manner (only excluding participants who already took part in the short-delay study).
A potential limitation of our study is that we found significant differences in the baseline measures of the short-term and long-term retest groups for the WMS-IV-NL Older Adult Battery. Because of the pseudo-random group assignment, we cannot think of a potential bias that could have caused these baseline differences. Note, however, that these baseline differences are taken into account statistically in the newly performed regression analyses. Another limitation of this study is the broad time window for the long-interval group. It would be a recommendation for further research to examine a retest interval as a continuous variable, making it possible to determine whether the gain in test performance reaches an asymptote and is no longer clinically meaningful.
A further limitation of the WMS-IV is that no alternate forms are available for this memory battery. Alternate forms of episodic memory tests may reduce the confounding practice effects as they intercept the recollection of specific items (Benedict, 2005;Benedict & Zgaljardic, 1998). Examples of memory tests with alternate forms are the California Verbal Learning Test-II (Delis, Kramer, Kaplan, & Ober, 2000) and the Rivermead Behavioral Memory Test-Third Edition (Wilson et al., 2008). Although it is well known that alternate forms cannot eliminate practice effects because other factors such as recollection of test demands, effective test-taking strategies, or familiarity with the test-procedure influence the effects of practice (Anastasi & Urbina, 1997;Beglinger et al., 2005;Calamia et al., 2012), it is suggested to use memory tests with alternative forms when a short retest interval is required.
In conclusion, the findings of this study show that the WMS-IV-NL has an adequate test-retest reliability. Since the authorized Dutch version of the WMS-IV and the original US version are highly equivalent, our results are likely to apply to the US version and other language versions of the WMS-IV as well. In line with this notion, the findings corroborate previously observed practice effects on the WMS-IV after a short time interval (Holdnack & Drozdick, 2009), but no evidence for practice effects were found after a long time interval of 12-24 months. Furthermore, practice effect-adjusted SRB change norms were provided for the WMS-IV-NL.

ACKNOWLEDGMENTS
We thank Pearson Assessment B. V., Amsterdam, The Netherlands, for authorizing and funding the development of the WMS-IV-NL. We thank Aileen Bremer, Bart Kral, Elaha Naimi, Jessyca Villier, Julius Ströhm, Karina Burger, Kimberly van der Wissel, Rens van Meegen, and Sophie Jellema for their assistance in the data collection.

DISCLOSURE STATEMENT
No potential conflict of interest was reported by the authors.