Regression-based normative data for the D-KEFS Color-Word Interference Test in Norwegian adults ages 20–85

Abstract Objective: The Delis-Kaplan Executive Function System (D-KEFS) Color-Word-Interference Test (CWIT; AKA Stroop test) is a widely used measure of processing speed and executive function. While test materials and instructions have been translated to Norwegian, only American age-adjusted norms from D-KEFS are available in Norway. We here develop norms in a sample of 1011 Norwegians between 20 and 85 years. We provide indexes for stability over time and assess demographic adjustments applying the D-KEFS norms. Method: Participants were healthy Norwegian adults from Center for Lifespan Changes in Brain and Cognition (LCBC) (n = 899), the Dementia Disease Initiation (n = 77), and Oslo MCI (n = 35). Using regression-based norming, we estimated linear and non-linear effects of age, education, and sex on the CWIT 1-4 subtests. Stability over time was assessed with intraclass correlation coefficients (ICC). The normative adjustment of the D-KEFS norms was assessed with linear regression models. Results: Increasing age was associated with slower completion on all CWIT subtests in a non-linear fashion (accelerated lowering of performance with older age). Women performed better on CWIT-1&3. Higher education predicted faster completion time on CWIT-3&4. The original age-adjusted norms from D-KEFS did not adjust for sex or education. Furthermore, we observed significant, albeit small effects of age on all CWIT subtests. ICC analyses indicated moderate to good stability over time. Conclusion: We present demographically adjusted regression-based norms and stability indexes for the D-KEFS CWIT subtests. US D-KEFS norms may be inaccurate for Norwegians with high or low educational attainment, especially women.


Introduction
The basic premise of Stroop tests is to measure an individual's ability to suppress a well-learned automatic response (i.e.word reading) in favor of an unfamiliar and incongruent task (i.e.naming the printed ink color of incongruously named color names) (Rabin et al., 2005;Van der Elst et al., 2006).Inhibiting the automatic response is demanding, leading to slower speed and lower accuracy on the incongruent task.This discrepancy is referred to as the "Stroop interference effect."While the exact nature of the cognitive processes responsible for the Stroop effect is still discussed, the effect is often regarded to measure the ability to inhibit cognitive interference and maintain focused attention (Scarpina & Tagini, 2017).The prefrontal cortex is highly involved when the Stroop test is performed (Duchek et al., 2013;Keifer & Tranel, 2013;Milham et al., 2002;Miller & Cohen, 2001), and clinical studies have shown that the Stroop interference effect is more pronounced in clinical populations, including patients with frontal lobe dysfunctions (Stuss et al., 2001), anorexia (Ferro et al., 2005), traumatic brain injury (Ben-David et al., 2011), substance use disorders (Streeter et al., 2008), mild cognitive impairment due to Parkinson's disease (Bezdicek et al., 2015) and dementia by various etiologies (Bayard et al., 2011;Clark et al., 2012).
In cognitively healthy adults, previous research has indicated that a higher level of education is related to better test performance (Brugnolo et al., 2016;Ktaiche et al., 2022;Van der Elst et al., 2006).Consistently, young adults perform better compared to elderly (Brugnolo et al., 2016;Zalonis et al., 2009).Regarding sex differences, there are inconsistent findings with some studies reporting slight sex differences in favor of women (Magnusdottir et al., 2021;Van der Elst et al., 2006), while others find no significant difference (Brugnolo et al., 2016;Ktaiche et al., 2022).Some have found significant interaction-effects on Stroop paradigms.Van der Elst et al. (2006) reported that age-related decline was stronger for individuals with less education.On the other hand, Magnusdottir et al. (2021) found that individuals with more education exhibited a stronger age-related decline.
The Stroop test exists in several versions such as the Victoria version (Regard, 1981), the Golden version (Scarpina & Tagini, 2017) and the Color-Word Interference Test (CWIT) from the Delis-Kaplan Executive Function System (D-KEFS) (Delis et al., 2001).All tasks yield variations of the Stroop interference effect but differ in how the main outcomes are measured.The Victoria Version and Golden Version use the number of correct responses in a fixed amount of time as the outcome.In comparison, the CWIT uses time to completion on a fixed number of test items as the main outcome.Furthermore, the CWIT features a unique fourth condition called inhibition/switching, in which participants are asked to alternate between inhibition and reading color-words.This condition may be more challenging than the classic Stroop color-word inhibition task for some individuals (Lippa & Davis, 2010).
A recent review commissioned by the Norwegian Psychologist Association, the Norwegian Directorate of Health, and the Norwegian Institute of Public Health (Ryder, 2021) indicated that the D-KEFS test battery was amongst the most popular tests used by clinicians in Norway.Also, previous studies have indicated that as much as 91% of Norwegian neuropsychologists use a version of the Stroop test (Egeland et al., 2016).Ryder (2021) reports that despite its popularity, Norwegian norms, in addition to validity and reliability measures, are lacking for the D-KEFS battery.The D-KEFS battery was consequently identified as a priority for validation and norming (Ryder, 2021).To our knowledge, there are no norms outside the original American age-adjusted norms presented in the D-KEFS manual by Delis et al. (2001) available for clinicians and researchers in Norway.Thus, the main objective of this study was to investigate the effect of demographic variables on CWIT performance and provide normative data for the D-KEFS CWIT in a Norwegian sample of cognitively healthy adults.Secondly, we assess the normative adjustment of the original age-adjusted norms from D-KEFS in the same sample of cognitively healthy Norwegian adults.Lastly, for a sub-set of the sample with data from one follow-up testing we provide indexes for stability over time on the D-KEFS CWIT.

Normative samples
To develop norms on the Color-Word Interference test (CWIT) we included healthy participants from three research projects in Norway: Studies from the center for lifespan changes in brain and cognition (LCBC) (n = 899), the dementia disease initiation study (DDI) (n = 77), and the Oslo MCI study (n = 35).Descriptive statistics from the normative sample is presented in Table 1.Joint exclusion criteria for all studies were severe somatic or psychiatric illnesses that might influence cognitive functioning.All participants underwent an interview screening for current or previous signs of neurological disorders, epilepsy, stroke, and psychiatric disorders.Participants reporting a subjective experience of cognitive decline such as memory complaints were excluded.The Mini Mental State Examination (MMSE) was used to assess global cognitive functioning and was not used to exclude participants in the current study (Table 1).Total scores on the MMSE were distributed as: 5 participants (0.5%) scored 24; 3 participants (0.3%) scored 25; 23 participants (2.4%) scored 26; and the remaining 940 participants with available MMSE (~96.8%)scored between 27 and 30.Inclusion criterion were ages 20-85.All participants reported Norwegian as their native language and almost all participants were of European ethnicity.The LCBC (Fjell et al., 2019) is a multi-disciplinary research center based in Oslo, Norway aimed at investigating normal trajectories of brain and cognition across the lifespan.Healthy participants from LCBC were drawn from three longitudinal sub-projects within the LCBC; Neurocognitive development (Tamnes et al., 2013), Neurocognitive plasticity (de Lange et al., 2018), and Biological Predictors of Memory (Storsve et al., 2014).Participants were recruited through newspaper advertisements and through local Universities and workplaces.Most participants from LCBC were screened for brain abnormalities on MRI scans and participants were excluded if scans showed signs of pathology.A subset of the LCBC sample (n = 335) had available follow-up examinations (average test-retest interval 3.4 years) on the CWIT, allowing for test-retest analysis to assess the stability of scores over time.All healthy participants in the test-retest sample fulfilled inclusion criteria and none of the exclusion criteria at baseline testing.Participants in the test-retest sample were not excluded based on these criteria at follow-up examinations.
DDI is a Norwegian multi-center longitudinal study on early phases of Alzheimer's Disease and other neurodegenerative diseases (Fladby et al., 2017).Inclusion criterion in the DDI study was age 40-80 years.The Oslo MCI study is the predecessor of the ongoing DDI study and followed the same study protocol as DDI.Assessments in Oslo MCI were performed between 2004 and 2012, and in DDI from 2012 to 2022.Healthy controls from DDI and Oslo MCI were either spouses of symptom group participants, volunteers recruited from advertisements in news outlets, or patients recruited at an orthopedic ward.

Color-word interference test (CWIT) administration procedures
The CWIT consists of four subtasks.CWIT-1 requires color-naming.The participant is asked to verbally identify the color of solid-colored squares from a sheet of paper.The squares are colored red, green, or blue, and are shown in a random order for a total of 50 items.CWIT-2 requires color-reading.In this subtask, participants are shown the color names "red," "green," "blue" (in Norwegian "rød," "grønn," "blå") printed in black ink.The participants are asked to read the color names one-by-one.CWIT 3 (inhibition) corresponds to the classic Stroop task, in which color names are printed with incongruent ink (e.g."red" printed in green ink).Participants are asked to verbally identify the color of the ink, (thus inhibiting the automated response of reading the color name).CWIT-4 is the inhibition/switching condition.Again, color names are printed with incongruent ink, but approximately fifty percent of the items are enclosed within a black frame.The participant is asked to perform the same task as before (i.e.name the printed color of the ink), except for stimuli that are enclosed within the black frames.Here, the participants are instructed to read the color names.For all subtasks, the participant is asked to respond one-by-one, in succession from left to right, as quickly as possible without making errors.All subtasks are preceded by a brief untimed practice trial consisting of a 10-item sample of the pertinent subtest.The stimuli are organized on laminated sheets in A4 size.Items are arranged in 5 rows of 10 items, totaling 50 items for each subtask.Time to completion and errors are recorded.Errors are recorded as either "corrected" or "uncorrected" by the participant.Difficulty discerning colors or visual impairments impact task performance on the CWIT, and it is important for test administrators to be sensitive for any color-blindness or visual impairment in participants.Administration of the CWIT was terminated by the test administrator if participants reported difficulty discerning colors associated with color-blindness.Administration procedures and standardized instructions for all tasks are described in the D-KEFS manual (Delis, 2005;Delis et al., 2001).Standardized commercially available materials for the D-KEFS CWIT in Norwegian were purchased from Pearson Clinical Norway.

Regression norming procedure
We first conducted explorative analyses to evaluate CWIT outcomes and relations to demographic variables before fitting normative models.Pearson correlations indicated significant relationships between age, education, and sex with CWIT 1-4 time to completion (Table 2).We then assessed the distributions for each CWIT subtest for normality which indicated significant positive skewness and kurtosis due to slow completion times for a small part of the normative sample.To normalize measures, we transformed CWIT 1-4 outcomes to a scaled score distribution (M = 10, SD = 3) similar to Espenes et al. (2020), Kirsebom et al. (2019), andTesta et al. (2009).Measures were normalized using the package "CTT" in R (Willse, 2022).Raw scores were transformed to scaled scores by first determining the percentile ranks of raw scores on CWIT 1-4.Then, percentile ranks were converted to scaled scores in the reversed order so that higher scaled scores related to faster completion time.For instance, the 50 th percentile corresponds to scaled score 10, and the 99 th percentile corresponds to scaled score 17.Raw score to scaled score conversions are shown in Table 3. Univariate analyses showing the relationships between predictors age and years of education on CWIT 1-4 scaled scores are shown in appendix Figures A1 and A2.
To produce the regression-based norms we performed multiple regression analyses on the CWIT 1-4 scaled scores with age, education, and sex as predictors.We also assessed squared and cubic effects, and interaction terms.Education and age were centered around the mean (i.e. years of education − 15.5) and (age − 46.2) to avoid issues with multicollinearity.For the model selection process, we proceeded similarly to Van der Elst et al. (2006).We started with a full model including all terms related ANOVAs for total explained variance (R 2 ), pvalue, and the Bayesian information criterion (BIC).The simplified model was preferred if p = ≥.01.The simplified model was subsequently used as reference for further simplification using the same alpha level criterion of α = .01.Regression models were reduced until the simplified model explained significantly less variance than the reference model (i.e.p = ≤.01).Lastly, we attempted to exchange squared terms in the final models with smooth functions using generalized additive models (GAMs).The model fit of the GAMs were compared to the linear models following the same procedure as described.BIC and ANOVAs favored the linear models with squared terms, and the smooth functions did not improve model fit to a substantial degree.After reaching the model structures with the best fit for CWIT 1-4 subtests (Table 4), we assessed assumptions of normality and heteroscedasticity using plots of standardized predicted scores and standardized residuals (James et al., 2021).Outliers and influential cases were visually assessed using plots of Cook's distance and standardized residuals.Visual inspection revealed no markedly diverging observations, thus no observations were deleted based on statistical criteria.All analyses were conducted using R version 4.2.1 and packages "dplyr" (Wickham et al., 2022), "CTT" (Willse, 2022), "Psych" (Revelle, 2023) and "mgcv" (Wood & Wood, 2015).

Testing the equality of age coefficients on CWIT subtests
Adding to the regression analyses described previously, we considered if the effect of age significantly differed on CWIT subtests.For instance, while the effect of age might significantly predict scores on one subtest, and not the other, this does not infer that the effect of age is different on the subtests (Gelman & Stern, 2006).To test the equality of coefficients we fitted multivariate models (seemingly unrelated regressions) reproducing the normative analyses in Table 4 for two subtests at a time.Then, we tested whether the unstandardized beta coefficients from age obtained through this analysis were equivalent in both models using Z-tests (Table A1).For these analyses we used an alpha level criterion of α = .01to reject the null hypothesis that the difference between the coefficients is zero (i.e. the coefficients are equal).Multivariate models were fitted because this allows for the calculation of standard errors that are adjusted for the covariance between beta coefficients.Analyses were conducted using R studio version 4.2.1 and the package "Systemfit" (Henningsen & Hamann, 2007) and Z-tests were conducted using the package "Multcomp" (Hothorn et al., 2016).

Errors on the Color-Word Interference Test
To provide normative estimates for errors on CWIT-3 and 4 we summarized corrected and uncorrected errors to a total error score.A total of 936 participants had data on errors.Unfortunately, as we did not record errors on CWIT-1 and 2, we do not provide data regarding the distribution of errors on these subtests.Preliminary analyses indicated that errors on CWIT-3 and 4 were zero-inflated and over-dispersed, as most participants did not make any errors during these subtests.Thus, the variables did not follow a normal distribution suitable for linear regression analysis.We conducted preliminary analyses to investigate if there were linear associations between errors on the CWIT-3 and 4 with age, education, and sex using Spearman's ROH and Mann-Whitney U tests (Table 2).Analyses were done to assess the need for demographic adjustment or stratification for error measures.Results from these analyses indicated a weak association between errors on CWIT-3 and 4 with demographic variables.We therefore provide percentiles based on the inverse cumulative distribution for errors based on the entire normative sample unstratified according to demographic variables (i.e.unadjusted for age, education, or sex).We then dichotomized the sample into participants who performed 0 errors and ≥ 4 errors on either CWIT-3 or 4 to see if these groups might differ in years of education, age, or sex.In total, 14.1% of the sample made ≥ 4 errors on either the CWIT-3 or 4. Thus, ≥ 4 errors on either CWIT-3 or 4 corresponded to a "low average" score according to neuropsychological nomenclature (Guilmette et al., 2020).We then assessed whether errors on CWIT-3 and 4 were related to performance on the task.First, we compared completion time on CWIT-3 and 4 between individuals who made ≥ 4 errors and individuals who made 0 errors using two-tailed independent samples t-tests without assumptions of equal variance.Further, we correlated errors on CWIT-3 and 4 with time to completion on CWIT-3 and 4 to check for a linear relationship between errors and task performance.

Calculating normative performance using regression-based norms
To determine the normative performance for a given individual (i) on a given test (j), we first calculate the predicted scaled score using the regression equations presented in Table 4.These equations utilize the following formula: Let D be a set of demographic predictors, where d n represents the n-th element of D; Predicted scaled score ij = intercept j + sum(beta_coefficient dj * d ni ).Then, the individual's raw score on the CWIT is converted to a scaled score using the raw score to scaled score conversion in Table 3.This reflects the individual's obtained scaled score.Lastly, the Z-score of individual (i) on test (j) is computed by [Z ij = (obtained scaled score ij -predicted score ij )/standard deviation of the residual j ], which can be further converted to a T-score by [(Z ij *10) + 50].

Assessing established American norms from D-KEFS in the Norwegian sample
We computed T-scores based on the original age-adjusted norms from the D-KEFS manual (Delis et al., 2001) on CWIT 1-4 for all participants (n = 1011).This resulted in four T-scores for each participant; T-score on the CWIT-1; CWIT-2; CWIT-3; CWIT-4.To assess if the original age-adjusted norms from D-KEFS sufficiently adjusted for demographical variables in the Norwegian sample, we performed multiple regression analyses with CWIT 1-4 T-scores as dependent variables.Age, years of education, and sex were used as predictors for all analyses.A significant beta-coefficient from any predictor was interpreted as a mal-adjustment in the norms.For these analyses we used a conventional alpha level criterion of α = .05.For example, if years of education significantly explained variance in the T-scores, this was interpreted as if the norms did not adequately correct for this demographic variable.Non-significant results were interpreted as an adequate adjustment.T-scores using the new Norwegian norms were calculated for all participants following the procedures detailed in the previous section.We then compared mean T-scores for all participants on the CWIT 1-4 using both the norms from D-KEFS and the new Norwegian norms.Mean T-scores on the CWIT 1-4 were compared using paired samples T-tests without the assumption of equal variances (Table 7).Plots of T-scores on CWIT-3 and CWIT-4 with fitted regression lines for the new Norwegian norms, the D-KEFS norms, and unadjusted T-scores are compared in Figure 1.Corresponding plots of T-scores on CWIT-1 and CWIT-2 are included in appendix Figure A3.Lastly, we compared the observed rate of participants scoring below a conventional cut-off (1.5 SD below the normative mean; T-score < 35) on CWIT 1-4 applying the original age-adjusted norms from D-KEFS and the Norwegian norms.Because the T-scores are expected to approximate a normal distribution we used two-tailed one proportion Z-tests to compare the observed rate in the samples with the expected base rate in a theoretical normal distribution (6.7%).
The Z-test estimates the probability that the observed sample proportion is equal to the theoretical proportion in the population.For these tests we computed the 99% confidence interval around the sample proportion thereby using a significance level of α = .01(Figure 2).To test if there were significant differences in proportions between the Norwegian norms and the original age-adjusted D-KEFS norms we used paired-samples proportion tests (asymptotic McNemar test without Continuity Correction) (Fagerland et al., 2014).

Norm calculator
To make regression-norms available and easy to use, we provide a free web-based tool that computes the regression equations and provide demographically adjusted T-scores for all CWIT subtests.The tool will be implemented as a self-contained HTML/ Javascript webpage but is temporarily available at (https://contattafiles.s3.us-west-1.amazonaws.com/tnt30503/ACkqU46CjUb0rss/cwit-calc.html) and is released as open source at (https://github.com/DDI-NO/cwit-calc)under Apache License, version 2.0.

Stability over time on the CWIT
A sub-set of the normative sample (n = 335) had available follow-up assessments allowing for test-retest correlations assessing stability over time.The sample consisted of 207 women (62%) and 128 men (38%) with a mean age of 52.6 years (SD = 18.4) and 15.6 (SD = 2.9) years of education at baseline.To ensure that stability indexes remained unified and relevant for clinical practice, participants tested later than 5 years after follow-up were excluded from the analysis (n = 22).Thus, the average time between assessments varied between 1 and 5 years with an average test-retest interval of 3.4 years (SD = 0.9).Intraclass correlation (ICC) estimates and 95% CIs were calculated based on a single rating, absolute-agreement two-way mixed-effects model (Shrout & Fleiss, 1979).We specified a priori ranges for stability based on conventional reliability classifications from (Koo & Li, 2016).Values between 0.5 − 0.75 indicate moderate stability and 0.75-0.9indicate good stability.To determine whether the difference in score on a CWIT subtest between baseline and follow-up obtained by an individual represent a significant difference, the score may be analyzed considering the Reliable Change Index (RCI).In RCIs, the difference score is divided by the Standard Error of Measurement of the Difference thus providing a standardized Z-score describing whether the change in score between baseline and follow is statistically significant (i.e.represent a reliable change) (Guhn et al., 2014).In Appendix Table A2, we provide readers with the necessary statistics to calculate RCIs themselves via the most common methods.For a review, please refer to Guhn et al. (2014).

Ethics
The Norwegian Regional committees for medical and health research ethics (REK) approved the projects the current study draws upon.Guidelines in the Helsinki declaration of 1964 (revised 2013) and the Norwegian Health and Research Act were followed.All participants gave written informed consents, and were informed of their right to withdraw, as well as potential risks and rewards involved with participation.
Tests of the equality of coefficients indicated that the effect of age was stronger on the complex trials CWIT-3 and CWIT-4 compared to CWIT-1 and CWIT-2 (Table A1).Furthermore, the effect of age was significantly weaker on CWIT-2 compared to all other subtests (p = <.001).Figure A1 shows the linear and quadratic effect of age on all CWIT subtests in the normative sample between 20 and 85 years.
There was a weak but significant positive relationship between years of education and scores on CWIT-3 (b = 0.078, p = .006,partial R 2 = 0.8%) and CWIT-4 (b = 0.098, p = <.001,partial R 2 = 1.2%).However, there were no significant associations between years of education and performance on the basic tasks CWIT-1 and CWIT-2 adjusted for sex and age.The relationship between CWIT scores and years of education is shown in Figure A2.

Calculating normative performance on CWIT-1 using regression-based norms
As an example, suppose that a 70-year-old man with 17 years of education completed the CWIT-1 in 35 s.The mean age in the normative group was 46.2 and the mean years of education was 15.5 (Table 1).First, we obtain the relevant coefficients from

Assessing established norms from D-KEFS in the Norwegian sample
As shown in Table 6, results from multiple regression analysis on T-scores calculated using the original age-adjusted D-KEFS norms indicated significant positive effects of age on all CWIT trials, meaning higher age predicted higher T-scores.As shown previously, women performed better than men on CWIT-1 and 3 in the Norwegian sample.However, the norms from D-KEFS did not account for this sex difference, and on average, women attained 2.3 and 1.4 higher T-scores compared to men on the CWIT-1 and 3 (Table 6).Moreover, there was a significant positive association between years of education and CWIT-3 and CWIT-4 T-scores, where participants with higher levels of education received higher T-scores.The combined effect of demographic variables in the age-adjusted scores were low, ranging from 1.6% to 3.0% explained variance.Nevertheless, there were significant mean differences between the D-KEFS norms and the new Norwegian norms (Table 7).On all trials except CWIT-1, the D-KEFS norms produced too high T-scores compared to the expected mean value of T = 50.On CWIT-2 the average T-score using the D-KEFS norms was 52.1; T = 54.4 on CWIT-3; and T = 53.2 on the CWIT-4.When utilizing the original age-adjusted norms from D-KEFS the proportion of participants scoring 1.5 SD or more below the normative mean was significantly different compared to the expected base rate on all CWIT subtests (Figure 2).The Norwegian norms were not significantly different compared to the expected base rate and the 99% CIs contained the expected base rate for all subtests (p >.01).Results from paired samples proportion tests showed significant differences between the estimated proportion of participants with scores 1.5 SD or more below the normative mean using the Norwegian norms or the original age-adjusted D-KEFS norms (p <.001) (Figure 2).

Stability over time on the CWIT
Intra-class correlation coefficients and 95% CIs are shown in Table 8.Based on the a priori specified ranges, all analyses indicated moderate to good stability in scores between baseline and follow-up using the Norwegian CWIT norms.Slightly lower estimates were obtained with the original D-KEFS norms.

Effects of demographics on the D-KEFS CWIT
We present normative data for the D-KEFS CWIT based on the performance of 1011 healthy Norwegians between 20 and 85 years of age.All four CWIT test scores were related to linear and quadratic effects of age, indicating a steepening trend towards lower scores for older participants.Quadratic effects of age have been reported on Stroop tests in similar samples spanning the entire adult range (Ktaiche et al., 2022;Van der Elst et al., 2006), but rarely in samples with more restrictive age spans (Bayard et al., 2011;Bezdicek et al., 2015;Magnusdottir et al., 2021;Seo et al., 2008;Tremblay et al., 2016).Consistent with most studies, we found that the basic subtests CWIT-1 (color naming) and CWIT-2 (word reading) were significantly less influenced by age compared to the complex inhibition trial (CWIT-3) and the inhibition/switching trial (CWIT-4) (Adólfsdóttir et al., 2014;Mitrushina et al., 2005).
Scores on the CWIT may decline with age due to a general age-related slowing of information processing (Salthouse, 1996) and specific deficits in executive functions like inhibitory control (Hasher & Zacks, 1988).Indeed, Adólfsdóttir et al. (2017) showed that higher age significantly predicted slower time to completion on CWIT-3 and 4 after adjusting for processing speed and performance on CWIT-1 and CWIT-2.In other words, when basic non-executive functions were regressed out, there was still an age effect on both CWIT-3 and CWIT-4, thereby implying that there was a specific factor associated with aging beyond generalized slowing.Delis et al. (2001) published contrast measures in the original D-KEFS norms to isolate executive components on the CWIT.However, these contrasts rely on simple subtraction between individual subtest scores, and it has been suggested that this approach might multiply the measurement errors on each test leading to low reliability (Crawford et al., 2008).Unpublished data from the same Norwegian sample used in this study support this, and we hypothesize that a regression-based approach to isolate executive components could mitigate this problem.We therefore aim to develop norms on CWIT-3 and CWIT-4 adjusted for performance for basic tasks using a regression-based approach and compare test-retest reliability with the original contrast scores from D-KEFS in a separate paper.

Effects of education on CWIT scores
Education was significantly, albeit weakly associated with scores on the CWIT-3 and CWIT-4 but was not significantly associated with scores on CWIT-1 and CWIT-2.This is in line with previous studies, where education has been reported to exert a strong influence on the complex Stroop inhibition trial (Bayard et al., 2011;Bezdicek et al., 2015;Brugnolo et al., 2016;Ktaiche et al., 2022;Magnusdottir et al., 2021;Van der Elst et al., 2006).Education is positively associated with full scale IQ (Ritchie & Tucker-Drob, 2018;Steinberg et al., 2005) which might explain why education was related to performance on the complex trials specifically.Moreover, cognitive reserve (Stern et al., 2023) has commonly been proposed as an explanation for how education is related to scores on Stroop tests (Hankee et al., 2016;Ktaiche et al., 2022;Seo et al., 2008;Zalonis et al., 2009).Relating to Stroop tests, Van der Elst et al. (2006) showed that individuals with low educational attainment had an accelerated lowering of performance with age compared to individuals with an average or high level of education.This indicates that the individuals with more education were resilient to age-related brain changes and pathology.However, our results indicated a positive effect of education on CWIT-3 and CWIT-4 scores that was independent of age.Therefore, our results are unlikely to be related to increased cognitive reserve.Compared with our results, some studies report stronger associations between performance on Stroop tests and education (Hankee et al., 2016;Magnusdottir et al., 2021) while others report comparable associations (Bayard et al., 2011;Troyer et al., 2006).The weak relationship between education and CWIT scores observed in our study might be influenced by sample characteristics in the normative sample.In particular, the Norwegian sample comprised individuals with relatively uniform and high educational attainment (M = 15.5, SD = 2.9).So, it follows that samples with uniform levels of education have reduced variance explained by educational attainment.Furthermore, some studies have indicated that the effect of education on scores could be less impactful for the highly educated (Van der Elst et al., 2006), and that the effect of education on Stroop performance could be diminishing after approximately 12 years (Hankee et al., 2016).Reports from Statistics Norway indicate that the educational level of the adult population is divided into three approximately equal parts (Statistics Norway, 2022); mandatory schooling (10 years education); high school level including trade schools (≤13 years); university degrees of various lengths (14+ years).Thus, the sample in this study had higher educational attainment than the population average, which may have influenced the relatively weak effect of education on CWIT scores.However, the education range in our sample was 7 to 23 years, and pertinent educational effects on test performance are modelled in our norms at both lower and higher levels of education.The discrepancy between norms is difficult to pinpoint as it could be influenced by several other factors, including the normative estimation method and a variety of cultural influences like educational quality and availability.Regardless, differences between norms highlight the importance that the normative sample resemble the intended population in terms of sample characteristics and geography (Heaton et al., 1999;International Test Commision, 2001).Specifically, using estimates from foreign samples exhibiting strong effects of education (e.g.Peña-Casanova et al., 2009;Seo et al., 2008) would likely provide inaccurate estimates of performance in the Norwegian sample where education evidently is not as relevant for predicting performance on the CWIT.

Sex differences
Women performed significantly better than men on CWIT-1 (color-naming) and .Previous studies on various Stroop paradigms report inconsistent results regarding sex differences with some studies reporting significant sex-differences (Magnusdottir et al., 2021;Mitrushina et al., 2005;Seo et al., 2008;Tremblay et al., 2016;Van der Elst et al., 2006) while others do not (Adólfsdóttir et al., 2017;Bayard et al., 2011;Hankee et al., 2016;Zalonis et al., 2009).Despite this, any observed difference has consistently favored women.Therefore, it is likely that the effect is small and that a large sample size is needed to detect a sex difference on Stroop tests.A recent article by Sjoberg et al. (2023) proposed that the female advantage on Stroop paradigms is related to superior color-naming abilities likely attributed to several specific verbal abilities relevant to performance on the task.These include increased speed on color labelling tasks and better performance on distractor suppression tasks.For a full review, please see Sjoberg et al. (2023).This could explain why we only found a female advantage on CWIT-1 (color naming) and CWIT-3 (inhibition), which has more color stimuli than CWIT-2 (word reading) and CWIT-4 (inhibition/switching).

Errors on CWIT-3 and CWIT-4
Number of errors on the CWIT were not related to age, education, or sex, which is surprising considering existing literature that report significant effects (Tremblay et al., 2016;Troyer et al., 2006;Van der Elst et al., 2006;Zalonis et al., 2009).Hankee et al. (2016) report that participants who made errors were significantly older and had less education compared to those with 0 errors.The present study did not find demographic differences between individuals with ≥4 errors compared to those with 0 errors.However, consistent with previous studies, our results indicate that errors were significantly related to worse performance on the task.On average, errors on the CWIT-3 and CWIT-4 were correlated with increased time to completion, and participants with ≥4 errors completed the CWIT-3 and CWIT-4 significantly slower.For clinical decision making, ≤ 3 errors on CWIT-3 and CWIT-4 should be considered the lower boundary for normal performance corresponding to the ~11-13 th percentile (Table 5).Unfortunately, we do not provide normative estimates for errors on CWIT-1 and CWIT-2.Previous studies indicate that about one in 20 healthy participants make one error on the CWIT-1 or CWIT-2 (Bayard et al., 2011) and multiple errors on these subtests may therefore indicate issues concerning the validity of the test performance.For normative estimates on CWIT-1 and 2 we refer to the original D-KEFS norms by Delis et al. (2001).

Assessment of the original age-adjusted norms from D-KEFS and clinical implications
A key aim of this study was to assess the adequacy of the original age-adjusted norms from D-KEFS in our sample of healthy Norwegians (n = 1011).Higher age significantly predicted higher T-scores calculated using norms from D-KEFS.From Figure 1 we can see that the yellow line is above the reference line for T = 50 which means that the participants on average performed better than the normative mean from D-KEFS given their age.This indicates that the original norms from D-KEFS slightly exaggerated the detrimental effects of aging on CWIT performance in the Norwegian sample.As a result, the older participants in the Norwegian sample received slightly elevated T-scores on average.Previous studies have found that age-related decline on cognitive tests largely dissipate when adjusting for cerebrovascular pathology, degeneration of structural and functional brain connections, and other pathologies (Anders M. Fjell et al., 2017;Borghesani et al., 2013;Borland et al., 2020;Harrington et al., 2018;yu et al., 2015).Age-related decline on the CWIT could therefore be influenced by sub-clinical pathology.Notably, such sub-clinical pathology may be regarded as normal, since most studies with normal healthy participants screened for various pathological conditions indeed report a strong influence of age on scores from Stroop paradigms and other neuropsychological tests (Mitrushina et al., 2005).However, the extent may vary between cohorts.As a result, the comparatively weaker age-effect observed in the Norwegian sample could be due to the Norwegian sample being healthier.These potential differences could be cultural, such as differences in lifestyle and access to health care, or simply cohort-specific, such as cerebrovascular disease prevalence in the study sample.For instance, the Norwegian sample consisted of predominately highly educated individuals that were thoroughly screened which may have caused an over-representation of protective factors in the sample.
The difference between norms may not only be due to cultural differences as cohort differences are observed within cultures as well (Trahan et al., 2014).While data regarding Stroop tests specifically is scarce, the literature on other cognitive tests suggests that average cognitive functioning in today's elderly is improved compared to the elderly 20 years ago (Hessel et al., 2018;Skirbekk et al., 2013).For younger individuals it is less clear with some studies showing that today's young may perform similarly or worse (Bratsberg & Rogeberg, 2018).The improvement of newer cohorts over older cohorts is called the Flynn-effect, which stipulates that improvements in nutrition, educational attainment and quality, health care, health promoting activities such as exercise, and reduction in cardiovascular disease cause newer cohorts to perform better on a variety of cognitive task (Skirbekk et al., 2013).Thus, the disparity between the Norwegian norms and the original age-adjusted norms from D-KEFS published in 2001 may also be due to time of measurement.
Unsurprisingly, the original age-adjusted norms from D-KEFS did not account for the difference between individuals with high or low educational attainment or the female advantage we observed in the Norwegian sample.As a result, the norms from D-KEFS on average produced higher than expected T-scores on the CWIT-2 ( 2  in 85 s.According to the Norwegian norms, her scores equate to T = 43 on CWIT-3, and T = 47 on CWIT-4, thus reflecting a below average performance.Using the D-KEFS norms her scores were T = 57 on both tasks, i.e. 1.4 SD and 1 SD higher compared to the Norwegian norms.For individuals with educational attainment or age closer to the sample average the choice of norms will on average yield less dissimilar results, but depending on the raw score of the individual the difference between the norms could still cause meaningful differences.For instance, using our Norwegian norms, a 55-year-old woman with 12 years of education with the same raw scores as in the previous example is estimated a score of T = 32 on CWIT-3 and T = 36 on CWIT-4.In comparison, the D-KEFS norms estimate her scores to T = 40 and T = 43, respectively.As a result, the difference between the Norwegian norms and the D-KEFS norms (0.8 SD and 0.7 SD, respectively) is smaller compared to the previous example.However, the Norwegian norms indicate a score on CWIT-3 more than −1.5 SD below the normative mean indicative of a potential deficit, while the D-KEFS norms indicate a score merely below average (-1 SD below the normative mean).Thus, while the average difference between the norms was estimated to T = 4.4 on CWIT-3 and T = 3.2 on CWIT-4, the difference might vary more depending on the age and/or years of education of the individual.Furthermore, depending on the obtained raw score of the individual these differences might lead to differences in diagnosis, however the diagnostic accuracy of the norms needs to be validated in future studies using independent samples.
We found that the proportion of participants scoring below a conventional cut-off set at 1.5 SD below the normative mean significantly differed from the expected proportion when using the original age-adjusted norms from D-KEFS.From Figure 2 it is apparent that the D-KEFS norms located fewer-than-expected  participants with low scores on all CWIT subtests.Furthermore, the percentage of participants with low scores significantly differed between the norms with the D-KEFS norms identifying significantly fewer participants (p <.001).This indicates that the norms from D-KEFS have a lower sensitivity for identifying individuals with low scores on the CWIT in the Norwegian sample which might have important clinical implications.Although not statistically significant, the Norwegian norms located more participants with low scores on CWIT-2 and CWIT-4 than we expected (8.6% and 8.3% respectively).The Norwegian norms were expected to match the theoretical base rate of 6.7% more closely since the norms were produced in the same sample and scores were transformed to follow a normal distribution.The difference is likely caused by some skewness in the CWIT-2 and 4 scaled scores despite the normalization procedures which caused a slight over representation of participants around this cut-off.Future studies should assess the Norwegians norms in an independent sample of Norwegians to address whether the new norms equal the theoretical base rate of impairment, and preferably investigate the diagnostic accuracy of the norms for diagnosing MCI in samples including patients with confirmed MCI.

Correlations between baseline performance and follow-up on the CWIT
All psychological tests should have available evidence of reliability that is relevant to the intended population (International Test Commission, 2001).Ryder (2021) identified that tests from the D-KEFS battery were lacking reliability estimates based on a Norwegian sample.In this study we had test-retest scores based on a relatively long follow-up (M = 3.4 years), and test-retest correlations are therefore assumed to not just be a measure of reliability, but also reflect true change rates with age.For instance, a low correlation would typically be interpreted as low reliability, but it could also mean that some participants have a different slope (i.e.change rate) from baseline to follow-up.We therefore characterize the test-retest correlations as stability of scores over time.A limitation concerning these analyses is that the cognitive status of participants was not assessed at follow-up examinations, and it is therefore possible that some participants worsened in their cognitive status between baseline and follow-up.This would likely cause lower ICCs, meaning that our estimates might underestimate the stability of scores over time.However, as seen in Table A2, the mean scores between baseline and follow-up were on average very similar and it is doubtful whether this had large influences on estimates.The results from reliability analysis indicated moderate to good correlations for all measures, with slightly better correlation for the complex trials CWIT-3 and CWIT-4 (Table 8).Using the Norwegian norms resulted in marginally better correlation compared to the original D-KEFS norms, likely due to the slight mal-adjustments in age, education, and sex previously reported.The difference between coefficients using Norwegians norms and D-KEFS norms were not tested, although the 95% CI overlapped and the difference in coefficients would likely not fulfill conventional criteria for statistical significance.

Limitations
The current study is subject to some limitations.Firstly, neuropsychological norms are typically intended to give an estimate of an individual's score compared to a broad target population, e.g.healthy Norwegians between 20 and 85 years old.The representativeness of a normative sample is therefore crucial for the accuracy of the normative estimates.Most of the participants included in this study were recruited as healthy participants from advertisements, university, and workplaces, and could be susceptible to biases associated with convenience sampling methods.That is, the sample estimates may not generalize to the broad target population due to unknown biases arising from a non-probability sampling method (Jager et al., 2017).Relatedly, the sample was composed of native Norwegian speakers predominantly of European ethnicity which does not reflect the multicultural landscape in Norway.Unfortunately, we did not record the ethnoracial background of participants, however most were likely of European ethnicity.As a result, the norms are likely less accurate for people with Norwegian as a second language and immigrants to Norway.Despite this, as the first normative study outside the original age-adjusted norms presented in the D-KEFS manual (Delis et al., 2001) we believe our norms contribute to an improvement in the accuracy of CWIT assessments in Norway.Another limitation of this study is the lack of participants in the middle-age.However, the norms rely on the joint estimation of the average effect across the included age span to calculate predicted scores and deviation from the predicted scores.Thus, it is unlikely that the lack of participants in the middle-age greatly affect the norms' ability to predict scores for individuals in this age range.Unfortunately, we were not able to source an independent sample to compare the new Norwegian norms with the original age-adjusted norms from D-KEFS.Instead, we assessed the adequacy of the original D-KEFS norms in our normative sample.Future studies should assess the validity of both the new norms and the original D-KEFS norms in an independent sample of Norwegians.Lastly, we did not formally screen for visual impairment or color blindness in participants but relied on self-report of visual deficits.Though participants were encouraged to use glasses when applicable, we cannot guarantee that undiagnosed visual impairment did not influence some participants' scores.

Conclusion
We propose regression-based norms for the Delis-Kaplan Color Word Interference Test (CWIT) based on a sample of healthy Norwegian adults between 20 and 85 years old (n = 1011).As far as we know, this is the first published study providing norms on the D-KEFS CWIT apart from the original age-adjusted norms from D-KEFS (Delis et al., 2001).Our results indicate that lower age, higher education, and female sex significantly predicted improved performance on the CWIT.The original age-adjusted norms from D-KEFS on average overestimated the difference between young and old participants and did not adjust for the female advantage or effects of education in the Norwegian sample.Consequently, normative estimates from the original D-KEFS norms may be inaccurate individuals with either low or high educational attainment, especially those in the midlife.The norms from D-KEFS identified significantly fewer-than-expected participants with low scores on CWIT 1-4 in the Norwegian sample.Low scores were defined as scores 1.5 SD or more below the normative mean.Thus, the D-KEFS norms had a lower sensitivity for detecting individuals with potential executive deficits compared to the Norwegian norms.In the Norwegian sample, ≥4 errors on CWIT-3 and CWIT-4 corresponded to the ~5th percentile, indicative of a borderline impaired performance.Errors were unrelated to demographical variables, but increased number of errors were significantly related to slower time to completion on the CWIT-3 and CWIT-4.The CWIT showed moderate to good test-retest stability in the Norwegian sample with a 3.4-year average follow-up time.For ease of use and quick computation of the norms we provide a normative calculator available at (https://contattafiles.s3.us-west-1.amazonaws.com/tnt30503/ACkqU46CjUb0rss/cwit-calc.html).
Note. linear regression lines are fitted for years of education and squared lines for age; for all figures a horizontal line from T = 50 represents the ideal normative correction and deviation from this line may indicate maladjustment in the norms.CWiT = Color-Word interference Test.D-KeFs = Delis-Kaplan executive Function system.

Note.
Dotted line indicates the expected base rate for 1.5 SD below the normative mean (6.7%).error bars indicate the 99% confidence interval (CI) around the estimate.*CI does not contain the expected base rate (p <.01).paired samples proportion tests indicated significant difference between rates from us norms and norwegian norms on all CWiT subtests (p <.001).

Figure A3 .
Figure A3.plots of T-scores on CWiT-1 and CWiT-2 calculated applying norms from D-KeFs, norwegian norms and T-scores unadjusted for demographic variables.
Note.SD = standard deviation of the mean; n = count; CWiT = Color-Word interference Test; Min = lowest score; Max = highest score; MMse = Mini Mental state examination; 1 76 participants had missing values on errors and 40 on MMse.

Table 2 .
pearson correlation between time to completion on CWiT 1-4 and demographical variables.
Note. scaled scores are not adjusted for demographical variables and are only used for computing the regression equations in Table4; CWiT = Color-Word interference Test.
note.b = unstandardized beta coefficient; s.e.= standard error of the unstandardized beta coefficient; sD residual = standard deviation of the residuals; sex was coded 0 = men, 1 = women; age and education were mean centered, thus age = (age -46.
Note.Cumulative percentages show proportion of the normative sample that attained k number of errors (or more); CWiT = Color-Word interference Test.

Table 6 .
results from multiple regression analysis on T-scores calculated with the original D-KeFs norms in the normative group (n = 1011).
Note. b = unstandardized regression coefficient; p = pvalue; partial R 2 = explained variance of predictor variable; adj.R 2 = explained variance of combined predictor variables; significant coefficients (p >.05) indicate mal-adjustment in the norms; age and education was mean centered; CWiT = Color-Word interference Test.

Table 7 .
paired sample t-tests between T-scores computed using the norwegian norms and original age-adjusted norms from D-KeFs.

Table 8 .
intra-class correlations between baseline and follow-up on D-KeFs CWiT subtests based on a sub-set of the normative sample (n = 335).