Statistics and psychometrics for the CAT-N: Documenting the Comprehensive Aphasia Test for Norwegian

ABSTRACT Background Ivanova and Hallowell 2013 emphasise the importance of reporting on test development and psychometric properties of tests in international journals. Such documentation may serve as references for other test developers and enable researchers and clinicians to assess reliability and validity issues in tests made for a language unknown to them. The CAT (Comprehensive Aphasia Test) is a general aphasia test which examines linguistic skills broadly, within the cognitive neuropsychological tradition; it has been and is being adapted to a number of languages. Aims The aim of this article is to document the statistical procedures used in the development and standardisation of the Norwegian adaptation of the CAT (CAT-N), to document its psychometric properties, and to discuss validity and reliability issues. Methods & procedures The adaptation of the CAT-N involved careful design of subtests and test items, taking into account features like word frequency, imageability and phonological and other language-specific linguistic variables. The prototype was tested on a normative sample of 85 persons with aphasia and a control group of 84 persons without aphasia. The items of some subtests were reordered based on the norming. A new scoring scheme was developed for two subtests of Picture description. The CAT-N includes the Aphasia Impact Questionnaire (AIQ), which is a new patient reported outcome measure developed for the CAT. Outcomes & Results Statistical methods are documented and discussed. Descriptive statistics for subtests and linguistic domains are presented. Internal consistency and partial inter-rater and intra-rater reliability aspects are investigated and documented. Construct validity is investigated and documented by factor analysis. Sensitivity and specificity are investigated through pairwise comparisons for subtests and domains and the use of normal-language cutoff values. Concurrent validity is investigated through comparisons with results from an existing aphasia test for Norwegian (NBAA). Conclusions The CAT-N is shown to have good reliability and validity, and it distinguishes well between persons with and without aphasia. The article provides explicit documentation of design decisions which may be useful in future adaptations of the CAT.


Introduction
There is a widely acknowledged need for valid and comparable language assessment tools for people with aphasia (PWA), for research purposes as well as for clinical use.In aphasia research, there is huge variability across countries and disciplines (Gitterman et al., 2012), making comparison across languages and groups difficult.This restricts the possibilities for large-scale investigations, for both monolingual and multilingual individuals with aphasia.For clinical purposes, there is again variation between languages: for English, for instance, there are many language assessment tools to choose from (see e.g.Howard et al., 2010), while for other languages, e.g.Croatian, Turkish, Hungarian, there are few and of varying quality (Kuvač Kraljević et al., 2020;Maviş et al., 2021;Zakariás & Lukács, 2022).For Norwegian, there are several aphasia tests of different types, both general and more specific ones.However, Norsk Grunntest for Afasi (the Norwegian Basic Aphasia Assessment, NBAA; Reinvang & Engvik, 1980) is the only comprehensive and standardised aphasia test and primarily aimed at classifying different aphasia syndromes.Furthermore, it is relatively old and slightly outdated.
The need of comparable assessment tools between different languages has been explicitly addressed in the COST network Collaboration of Aphasia Trialists (CATs; IS1208, 2013-2023, since 2017 funded by the Tavistock Trust for Aphasia).The network includes participants representing 40 countries and 43 languages from across the world (https://www.aphasiatrials.org/).Within this network, the working group on "Aphasia Assessment and Outcomes" decided in 2013 to adapt the Comprehensive Aphasia Test (CAT; Swinburn et al., 2004) to the different languages.This decision was based on the need for a comprehensive, yet relatively short test, including cognitive, linguistic and psychosocial aspects.The CAT is widely used in English-speaking countries, and validated and normed for English (Howard et al., 2010).Furthermore, the linguistic parts of the CAT are explicitly based on several psycholinguistic and linguistic variables (e.g.frequency, imageability, word and sentence length and complexity) facilitating an adaptation into different languages (Fyndanis et al., 2017).Presently, the test is under adaptation into 16 languages, in addition to the five that are already published (Croatian: Swinburn et al., 2020;Dutch: Visch-Brink et al., 2014;Hungarian: Zakariás & Lukács, 2022;Norwegian: Swinburn et al., 2021;Turkish: Maviş et al., 2021).
It is well established that a simple translation of an assessment instrument is never appropriate to obtain comparability (Bates et al., 1991;Paradis & Libben, 1987).However, neither is an adaptation from one language to another straightforward.Languages differ in many aspects, and the more a language in an adapted version differs from the original, the more difficult it is to assess important aspects of that language, yet maintain comparability.Such challenges and solutions in the adaptation of the CAT are discussed in detail in Fyndanis et al. (2017).
In addition to the linguistic challenges, a great number of design decisions pertaining to statistical procedures go into the development and standardisation of a test.Unfortunately, these often remain undocumented and unpublished (Ivanova & Hallowell, 2013).This lack of available documentation places a heavier burden than necessary on future test developers, and it increases the risk of introducing unnecessary discrepancies between test adaptations and hence reduces their intended comparability.The main aim of the present paper is to provide thorough documentation of the statistical procedures used in the development of the CAT-N, which may then function as guidelines for future adaptations of the CAT.In addition, we briefly outline aspects of the adaptation process, describe innovations of the CAT-N and present and discuss its psychometric properties, including reliability and validity aspects.

The adaptation of the CAT-N
The adaptation to Norwegian was conducted by a team of speech and language therapists (SLTs) and linguists (the Norwegian adaptation team).To ensure linguistic and cultural comparability across the different language versions, a set of criteria and guidelines were developed within the working group of "Aphasia Assessment and Outcomes" mentioned above.First, the fundamental decision was made to use the same number of subtests and items in all versions of the CAT.Furthermore, since the CAT is explicitly based on several psycholinguistic variables, ways to establish those within each language had to be agreed upon.With respect to frequency (how often a word is used), measures should ideally be taken from spoken corpora, but written corpora could also be used since spoken and written frequency measures generally correlate well (Pastizzo & Carbone, 2007).For Norwegian, we used the web-based written corpus NoWaC, based on 700 million words taken from the .nodomain (Guevara, 2010).As for imageability (how easily a word evokes a mental image), values have to be based on ratings from native speakers (Paivio et al., 1968).For Norwegian, we used imageability values from the database Norwegian Words (Simonsen et al., 2013;Lind et al., 2015).This is a searchable lexical database containing approximately 1650 nouns, verbs, and adjectives, for which one can get information about (psycho)linguistic variables that are known to affect language acquisition, storage, and processing of words (e.g.imageability, frequency, age of acquisition, sound structure, and word length).
The words used in the test could most often not be directly translated from English.For example: in one subtest, the English monosyllabic word pear was used.In Norwegian, the corresponding word is the disyllabic paere / 2 pae:ɾə/.If the point was to see whether persons with aphasia understood the word for pear in their language, a translation would have been appropriate.However, in the subtest in question only single-syllable words should be included.Furthermore, one distractor word was supposed to sound nearly the same, to see whether the person could distinguish phonologically similar words.In the English version of the CAT this word is bear, but in Norwegian the translation is bjørn /bjø:ɳ/, which would not work.Thus, in order to fill the linguistic criteria and maintain comparability across languages, actual test words in the various languages are often very different.
Many of the subtests in the CAT are picture-based.For the new versions, the Croatian artist Marko Belić was engaged to draw new black-and-white illustrations.Both for linguistic and cultural reasons, many new pictures were needed.To make sure that a picture actually evoked the right word, naming agreement tests were conducted in all languages, where only pictures rated with the same word by at least 80 % of the (20+) raters were accepted.Many pictures had to be redrawn many timesfor example, the word for mouse in Norwegian (mus) needed several rounds with different sizes and addition of cheese so as not to be confused with rat (rotte).And the Norwegian word for waffle (vaffel) needed to have its local, heart-shaped form to be recognised as such -although the cognate word in English is very similar, the shape of the item is different.

Innovations in the CAT-N
While we took care to follow all criteria and guidelines that were agreed upon across languages in the "Aphasia Assessment and Outcomes" working group, we also made certain innovations.One concerned the scoring of the picture descriptions (subtests 19 and 27).Another was to rearrange the order of some of the items in five of the subtests (subtests 7-10 and 17).The third innovation concerned the replacement of the Disability Questionnaire (DQ), which was part of the original CAT from 2004, with the Aphasia Impact Questionnaire (AIQ).

Picture descriptions
The CAT includes two narratives elicited from the same picture, one oral and one written.The picture shows a man sleeping in his chair while a cat on a shelf above tries to catch goldfish from a fishbowl, and at the same time pushing down a row of books, falling towards the man's head.A child playing on the floor tries to awaken and alert the man.The working group agreed that the original scoring system was too complex, in particular for clinical purposes, but did not decide on a common scoring system.
For the CAT-N, we wanted to score both grammatical skills (form) and how well the participant could describe what happened in the picture (content).We took as our point of departure the scoring system from the Dutch adaptation (CAT-NL; Visch-Brink et al., 2014) and decided on three parameters for form (tempo/fluency (relevant only for oral description); grammatical complexity; grammatical correctness), and four content units (man sleeps; girl points/awakens/alerts; cat chases fish; books fall).Each of the form parameters was scored on a scale from 3 (good) via 2 (medium) and 1 (weak) to 0 (missing), and the content units on a scale from 2 (complete and precise) via 1 (present, but not complete and/or precise) to 0 (missing).The scores were logged on a separate scoring sheet, an English translation of which is shown for the oral subtest in Figure 1.(See sections below for more details on scoring and on inter-rater reliability for the Picture descriptions.)

Reordering of items in five subtests
For five of the subtests there is an abortion rule (subtests 7-10 and 17); if the participant fails to obtain a positive score for a certain number of consecutive items, that subtest should be aborted.In the norming study, the items in these tests were given in an arbitrary order, and the test administrators were asked not to use the abortion rule if possible, in order for us to be able to investigate whether some items were more difficult than others.For the published test, the items were reordered and put in ascending order of difficulty (see below for details).

Aphasia Impact Questionnaire
The Disability Questionnaire (DQ) was not included in the adaptation of the CAT across languages.However, as reported by Swinburn et al. (2019), it was decided to replace the original DQ in the CAT with a new patient reported outcome measure: the Aphasia Impact Questionnaire (AIQ).(This is to be published in the new, second edition of the English CAT, expected in 2023.The CAT-N is thus the first CAT with AIQ included.)The AIQ is constructed to explore and evaluate the consequences of living with aphasia, and assesses communication, participation, and emotional well-being through a picture-based scale with five response alternatives.For the Norwegian version, we carried out a focus group interview with a group of five persons with aphasia, following an interview guide primarily focusing on reading and writing, and whether the Norwegian AIQ should include digital communication.This resulted in the addition of one question in the AIQ of CAT-N concerning how aphasia affects daily functioning in a digital world, increasing the number of questions in the AIQ from 21 to 22.The final version was then tested on 21 persons with aphasia, not to establish norms, but to see how the AIQ functions in actual use.The CAT-N is the first test for Norwegian which addresses both impairment-based and consequence-focused issues.

Methods of statistical analysis
All calculations and statistical analyses were carried out using R, a software environment for statistical computing (R Core Team, 2020).Following common convention, we chose α=0.05 as level of significance.To control for familywise error rate (FWER), significance levels were adjusted using Šidák correction when relevant (Abdi, 2007;Šidák, 1967).
Comparisons of two independent samples were made using the rank-based Wilcoxon test (Wilcoxon, 1945;Zimmerman & Zumbo, 1990).Since many of the scores from the test are normally distributed, we could sometimes alternatively have used the parametric Welch' t-test, but chose to use the same two-sample test for all analyses in order to simplify comparison of results.Correspondingly, all two-sample effect sizes are given as the point-biserial correlation coefficient r pb (LeBlanc & Cox, 2017).Cohen (1992, p. 99) gives guidelines for interpreting values of r pb : r pb ≈ 0.1 small effect r pb ≈ 0.3 medium effect r pb ≈ 0.5 large effect We do not report values of Cohen's d, since many of the distributions are skewed; in particular, the limited dispersion of the control group values would artificially amplify values of d.
Correlations were tested using Spearman's rank-based correlation and the effects are given as r s (see e.g.Levshina, 2015, pp. 130-134).95 % confidence intervals (CI) for r s were calculated using the function spearman.ci in the R package RVAideMemoire (Hervé, 2021).
Factor analysis with varimax rotation 1 was carried out using the factanal function in the R package stats (R Core Team, 2020).We defined the minimal adequate model as the smallest model which was not significantly different from a perfect fit, using the built-in Χ 2 statistic of the factanal function.

Overall design
The entire CAT-N consists of 6 cognitive and 21 linguistic subtests, 2 in addition to the AIQ.Of the 27 subtests, 24 contribute to a total of 9 domains 3 , of which 8 are linguistic domains.Two of the linguistic domains consist of just one subtest each, whereas three of the cognitive subtests do not contribute to any domain.The overview of subtests and domains is shown in Table 1 and corresponds to the original CAT.

Respondents and procedure
When the adaptation was completed, we recruited SLTs across the whole country to collect data for the norming study.Following Ivanova and Hallowell's (2013, p. 906) strong recommendation, our initial aim was a sample of 100 PWA and 100 persons without aphasia as a control group (CG).When the Covid-19 pandemic started, testing became difficult, so the ultimate number of participants was 85 PWA and a CG of 84.Inclusion criteria for PWA were 1) known diagnosis of aphasia as a result of stroke, 2) aphasia in all phases from acute to chronic, 3) having capacity to consent.The PWA had already been assessed according to the general procedure of their SLT; for about 50 %, test scores from the Norwegian Basic Aphasia Assessment (Reinvang & Engvik, 1980) were supplied.Post- onset times (shown in Table 2) varied from 20 days to almost 14 years.We have no further information on neurological background or aphasia severity.
In terms of age, the two groups are quite similar, but the proportion of women is greater in the control group than among the PWA, as is the proportion of people with more than 3 years of higher education.As shown in Table 5, N varies somewhat between the subtests.The reason for this is that individual subtests occasionally could not be scored, typically because the test person accidentally was given faulty instructions.Also, one PWA became unavailable after subtest 12.
SLTs collected test data for the PWA by scoring the various subtests following the instructions on the scoring sheets.The two Picture descriptions were not scored by the SLTs, however; recordings of the Oral descriptions were transcribed and both the transcriptions and the Written descriptions were scored by members of the project group, three SLTs and one linguist.

Basic scoring of items and subtests
Many subtest items are given raw scores 1 for correct answer and 0 for erroneous or lacking answer.In some subtests, there is a range of obtainable scores for each item, often 0-2.In a few subtests, the scoring scheme is a bit more complicated.
Raw scores of most subtests are obtained by simple addition of the scores of the individual test items.In subtest 1, Line bisection, a lacking answer for one or more items results in no score (NA) for that subtest.
Raw scores of domains are obtained by simple addition of the raw scores of the subtests in that domain.This procedure has the consequence that some subtests contribute considerably more to the domain score than others.
All subtests and domains thus have a theoretical lower and upper bound of their scores.We call these the pessimum value and the optimum value, respectively.These theoretical scores are not achieved in all instances, however, and the smallest and largest actual score of a subtest or a domain are denoted the minimum score and the maximum score, respectively.The minimum and maximum will typically vary between PWA and the CG.

T-scores
T-scores are scores which are standardised to mean (m) 50 and standard deviation (sd) 10 ( Clark-Carter, 2005, p. 2067;Frick et al., 2010, pp. 28-29).Raw scores both from subtests and from domains were transformed to T-scores using the rank-based Rankit algorithm (Bliss et al., 1956), which is designed to transform any distribution into one which is approximately normal and with known mean and sd.Solomon and Sawilowsky (2009) compare the four most commonly used transformation algorithms and conclude that the Rankit algorithm has the best performance in general and particularly for heavily skewed raw scores like the ones present in most of the subtests.The outline of the algorithm is as follows: (1) Sort all raw scores in ascending order of size.
(2) Assign a rank value to each raw score, so that the lowest raw score becomes 1, the second lowest 2, etc. Ties are to be given the mean of their rank values.
(3) Subtract 0.5 from each rank value.(4) Divide each resulting value by the total number of values.
(5) Apply the inverse cumulative normal distribution function to these values, with the parameters m=50, sd=10. 4(6) Round to nearest integer.Rounding is not strictly necessary, but presenting decimals gives a false impression of higher accuracy than is actually present.
The above algorithm was applied to the set of raw scores of the PWA for each subtest and each domain.Since all possible test scores were not present as raw scores in the norm set, conversion keys had to be calculated for the missing values; any values below the minimum value in the norm set were given the T-score given to the minimum raw score and correspondingly for values above the maximum value in the norm set.Hence, obtaining a T-score outside the range present in the normative sample is not possible.Any missing values within the range of actual scores were calculated as linear interpolations of the nearest non-missing (unrounded) T-scores, followed by rounding.
The advantage of T-scores over for example z-scores is that the resulting numbers are more manageable, ranging typically between 20 and 80.The actual boundaries depend on the number of observations in the normative set.Hence, the range of T-scores may vary between different versions of the CAT, depending on the size of the normative set.With N=85, all T-scores lie between 25 and 75.Given a normal distribution of T-scores, the values may be used to estimate percentiles (see Table 3).
For raw score distributions which are symmetrical and without prominent mode values near the boundaries, the transformation algorithm works well in transforming the original distribution into a normal distribution.However, most of the subtests have heavily left- For the domain variables, on the other hand, the resulting distributions of T-scores are largely normal.Written picture description and Memory are the most obvious exceptions.In the latter, 30 out of 85 PWA score the optimum, whereas in the former, 27 out of 84 PWA score the pessimum, i.e. zero (and 5 score the optimum).For Memory, this is a natural consequence of the capacity of memory not necessarily being affected by aphasia (Papathanasiou et al., 2017, p. 4).For Written picture description, the reason is likely to be a near complete loss of writing ability in many of the PWA in the normative sample.We assume these to be natural properties of any test result from a normative sample of PWA.
Normality scores for the domain T-scores are given in Table 4; left and right skews are represented by negative and positive values of γ 1 , respectively.As can be seen, Spoken picture description and Reading are also somewhat skewed, although the p-values from the Shapiro-Wilk normality test are above the conventional 0.05 level.The skew in both these domains is caused by a clustering of values near the optimum.In Spoken picture description, 36 of 82 PWA (44 %) score between 14 and the optimum raw score of 17.In Reading, 31 of 84 PWA (37 %) score between 66 and the optimum raw score of 70.In these two domains, it is reasonable to surmise that the subtests were slightly too easy for the normative sample and that a more desirable distribution could have been obtained with slightly increased difficulty.As explained above, the low difficulty of the subtest results in lower optimum T-scores when the test is administered, in the case of Reading causing the optimum T-score to be only 66 (see Figure 2), noticeably lower than the possible optimum of 75 for N=84.For Memory and Written picture description, the deviation from normality is too substantial for parametric methods of analysis to be applied when these domains are involved.

Scoring of Picture descriptions: oral and written
The two subtests for picture description, oral and written, comprise items related to form and items related to content.In both subtests, there are 4 items related to content, each with an optimum score of 2 points, yielding a total optimum score of 8 for content.The optimum score for each formal item is 3, however, and the number of formal items differs between the subtests: 3 for the oral descriptions and 2 for the written, yielding total    optimum scores for form of 9 and 6, respectively.Figure 1 displays an English translation of the scoring sheet for Oral picture description.
Fluency is scored subjectively based on speed, pausing, hesitation, self-correction and whether contributions by the SLT are needed.Further elaboration and example scoring sheets for both Written descriptions and transcribed Oral descriptions are provided in the manual (Swinburn et al., 2021).
Originally, we wanted to weight content and form scores to make each contribute 50 % of the subtest score.Two mathematical issues turned out to give rise to some unwanted properties in such weighted scores.First, raw scores are converted to T-scores using a rank-based algorithm (see above), which, like all ranks, will magnify smaller differences and diminish larger differences.Since the optimum values for content and form are of similar size, weighted sums scaled up to for example 100 formed clusters of almost identical values, separated by substantial value gaps.Converting these clustered values into ranks gave undue importance to the small within-cluster differences and thus altered the characteristic properties of the distribution.Second, weighted sums without the scaling up resulted in values which were almost but not quite identical to the unweighted sums; the divergences were small particularly for the written descriptions.Thus, the resulting weighted scores could be confusing to the person administering the CAT-N, and the effect of the weighting would be minute.
For these reasons, both picture descriptions are scored by simple addition of the items, like the majority of the subtests.Hence, the optimum scores for the oral and written picture descriptions are 17 and 14, respectively.This has the consequence that the relative contribution of the formal aspects is somewhat greater in the oral descriptions than in the written.

Reordering of test items in subtests with abortion rule
The administering of subtests 7-10 and 17 involves an abortion rule.If a test person fails to obtain a positive score for 4 consecutive items (8 items for subtest 17), the subtest should be aborted in order to spare the person the unnecessary strain of being confronted with test items the person is not capable of answering.For this procedure to function in an optimal fashion, the test items need to be sorted in ascending order of difficulty.In the norming procedure, therefore, the PWA were given all test items for these subtests, and the results were then used to sort the items for each subtest according to the actual scores.Subsequently, the norming scores were calculated as if the abortion rule had been in effect, i.e. disregarding for each PWA any scores given following a 4-item stretch of scores of zero.In practice, the effect on the final scores of reordering the subtests was minimal for most of the subtests, but for subtest 10, the procedure resulted in reduced scores for 9 % of the PWA, possibly indicating that the resulting scores of these 8 individuals did not fully reflect their competence, which hence could also be the case for some individuals when the test is used clinically.In the published CAT-N, the test items are ordered according to difficulty.

Comparing Word fluency and Object naming
Persons with executive difficulties may experience difficulties in word fluency, even if they do not have aphasia (Amunts et al., 2020(Amunts et al., , 2021;;Swinburn et al., 2021, p. 10)

Cutoff scores
We used the control group to establish a cutoff score for normal language performance (Ivanova & Hallowell, 2013, p. 906), using the same procedure for both subtests and domains.A cutoff score must be a compromise between sensitivity and specificity, i.e. a balance between false negatives and false positives.In line with the choice of Swinburn et al. (2004, p. 101), we defined the cutoff score as the highest score which includes at least 95 % of the control group of people without aphasia.Since 80 of 84 is 95.2 %, a cutoff score thus defined will yield 4 (or fewer) false positives from the control group sample for the CAT-N.The number of false negatives will vary between subtests and between domains (see Table 13 and Figure 2).

Domain means
We calculated an overall mean T-score for the PWA as the mean of the T-scores for the 8 language domains.Swinburn et al. (2004, p. 46) caution that such mean values may not be reliable if more than 2 missing values are involved; of the 84 PWA in the normative sample who completed the test set in the CAT-N, 4 had missing values for 1 domain, and 1 had missing values for 2 domains.These were included in the calculation of the overall mean T-scores.None had more than 2 missing values.
The domain means are decimal numbers and determining a specific cutoff value for these is hence less straightforward than for the integer values of the individual domains.Any value between the 4th and the 5th individual in the control group would include "at least 95 %" of the control group.We decided to prioritise sensitivity over specificity and chose as the cutoff the value of the person of rank 5 in the CG, which also follows literally the definition of the highest value including at least 95 % of the CG.

Descriptive statistics for subtests and domains
Statistics for raw scores and T-scores for subtests and domains are shown in Tables 5-8.
There is a slight negative correlation between age of the PWA and the domain means, r s ≈-0.24, 95 % CI [-0.04, -0.44].There is no difference in domain means between genders, although a possible gender effect might be masked by the interaction between gender and education level.There is, however, no effect of education level on domain means, and hence there is no reason to believe that the differences between the PWA and the control

Relationships between domains
Table 9 shows r s between all the domains.Correlations are medium to strong between all the linguistic domains, the lowest value of r s being 0.42 (between Reading comprehension and Repetition).The majority of the pairwise differences between these coefficient values are not relevant, however, as most of the dispersion between them lies within 95 % confidence intervals (not shown here).Note the high correlations between Written picture description and Writing (0.81), between Spoken picture description and Naming (0.75), between the two Comprehension domains (0.80) and between the two Reading domains (0.76).These high correlation coefficients correspond to the results from the factor analysis shown below.
Also interesting is the fairly strong positive correlations between all of the 8 linguistic domains.Cronbach's alpha for the 8 linguistic domains is 0.92, indicating a strong relationship between the different domains in terms of loss of abilities.
Finally, we calculated the mean correlation coefficient r s between each linguistic subtest and the other linguistic subtests and found that the mean correlation with the rest of the subtests was weakest for Auditory comprehension of stories (0.28), Repetition of complex words (0.35) and Repetition of nonwords (0.35).However, all these subtests consist of a small number of items (4, 3 and 5, respectively) and their scores will thus vary arbitrarily to a greater degree than other subtests.Of the subtests with a more reliable number of items, Copying has the lowest mean correlation with the other subtests (0.40); as shown below, Copying also is the subtest with the lowest internal consistency.

Reliability
Reliability concerns the consistency, stability and accuracy of a test (Ivanova & Hallowell, 2013, p. 907).This section discusses the internal reliability, test-retest reliability, inter-rater reliability and intra-rater reliability of the CAT-N.

Internal consistency
Table 10 indicates the internal consistency of the subtests in the language battery, showing values of Cronbach's alpha for each subtest, based on the raw scores of each test item.The only subtest showing "unacceptably" low internal consistency is Copying (α≈0.59).All the other subtests have "sufficiently" high values, α≈0.70 or above; most of them with values interpreted by Cohen as "highly" or "very highly" reliable.In general, one expects lower consistency values in subtests comprising few items (Ivanova & Hallowell, 2013, p. 907) and Table 10 demonstrates that several of the subtests with α<0.80 have 5 or fewer items.Table 10 further indicates the internal consistency of the domains, showing values of Cronbach's alpha for each domain, based on the T-scores of the subtests.
As shown, the Memory domain displays an "unacceptably low" degree of internal consistency, indicating that the Memory domain does not measure one consistent entity, but rather two quite different capacities.The correlation between the two subtests is only r s ≈0.24, 95 % CI [0.024, 0.43], indicating only quite a weak correlation between them, if any.Swinburn et al. (2004, p. 115) found that these two subtests in the CAT-EN clustered "closely" in a hierarchical cluster analysis.
All the other domains display coefficients in the vicinity of the range interpreted as "highly reliable" by Cohen (1992).This is an indication that the correlation between the subtests within each domain is substantial enough for the accumulation of the scores into domain scores to be meaningful.At the same time, the fact that the internal correlation is not perfect indicates that the various subtests do tap into slightly different capacities, as they should, although differences may also be caused by arbitrary variation, especially for subtests with few test items.Among the subtest pairs with lowest correlation are Naming of objects and actions, r s ≈0.51, Repetition of nonwords and sentences, r s ≈0.47, and Auditory comprehension of words and stories, r s ≈0.46.On the other hand, some individual subtests correlate really strongly with their domain, indicating that these subtests alone could function as simplified domains; examples are Auditory comprehension of sentences, r s ≈0.97, Reading comprehension of sentences, r s ≈0.96, Reading of words, r s ≈0.96, and Word fluency, r s ≈0.93.A large number of items and high dispersion are two (partially related) causes of such strong correlations between a single subtest and its domain; also, a small number of subtests in the domain will tend to cause strong correlation with (at least one of) the subtests.

Test-retest reliability
Collecting data for test-retest reliability analysis proved too demanding within the project, and we have not carried out any such analysis for the CAT-N.Most of the battery has very similar characteristics to the CAT for English, and we refer the reader to Swinburn et al. (2004. pp. 108-109) and Howard et al. (2010, pp. 66-67).Swinburn et al. (2004) tested inter-rater reliability (IRR) for the CAT-EN and found "excellent agreement for almost all of the tests" (p.111).We decided to carry out IRR analyses for the two Picture description subtests, since these have a design for scoring not previously employed (see above) and perhaps rely on a more subjective scoring than the other subtests.All entries were scored by an SLT, and three other persons (two SLTs and one linguist) scored 20 entries each from each group.Values of Krippendorff's alpha (Table 11) show that inter-rater reliability is acceptable (Krippendorff, 2004, p. 241) for both groups and both subtests, although weaker for the spoken descriptions by PWA than for the other three group/subtest combinations.

Inter-rater reliability
Investigating the individual test items, alpha values exceed 0.86 for all written test items for both PWA and control group, whereas values for the oral test items display more dispersion, between 0.49 (The child alerts) and 0.96 (The books fall) for PWA, and between 0.60 (The child alerts) and 0.94 (The books fall) for the control group.Additionally, alpha is less than 0.667 for Fluency (PWA) and for Grammatical complexity (CG), indicating less than acceptable reliability for these items (see above).It is worth noting that even though these individual items seem intrinsically difficult to score, the scores of the entire subtests appear to be reliable.
For the rest of the test battery, we refer to Swinburn et al. (2004, pp. 108-109).

Intra-rater reliability
We evaluated intra-rater reliability for the two picture descriptions, both for PWA and for the CG.Three raters rated different samples of n=20 twice with an interval of four months in between.The reliability was calculated as the mean ratios of correspondence for the three.According to Mackenzie et al. (2007) and Nicholas and Brookshire (1995), a ratio of correspondence above 0.8 is considered acceptable.Table 12 shows the mean ratios of correspondence.
Like for inter-rater reliability, the correspondence is weaker for PWA than for the control group, especially for the spoken subtest, again indicating that these items are more difficult to score consistently.Looking at the individual test items (not shown here), all items have mean ratios of 0.8 or higher, indicating acceptable correspondence.

Validity
The validity of a test concerns the extent to which it measures what it is intended to measure and hence whether the test should be used as a foundation on which to base conclusions (Ivanova & Hallowell, 2013, p. 908).This section focuses on the sensitivity and specificity of the CAT-N, its concurrent validity and its construct validity.

Sensitivity and specificity
We have investigated how well the CAT-N discriminates between PWA and persons without known aphasia through comparison of PWA and CG, domain means and domain cutoffs.

Comparison of PWA and control group
We compared subtest T-scores and domain T-scores between the PWA and the control group.Table 13 shows the results of the comparisons and the effect sizes.Since multiple significance tests have been carried out, the p-values should be compared to α adjusted for FWER: α 27 =0.0019for the subtests and α 9 =0.0068 for the domains, corresponding to α 1 =0.05 without correction.
Table 13 shows that the p-value for Line bisection is just over the FWERadjusted α 27 , and only 17 % of PWA fall below the 95 % cutoff for this subtest.All other subtests and all domains distinguish well between PWA and the CG.Moreover, all linguistic domains demonstrate very strong effects: r pb between 0.73 and 0.87.In terms of sensitivity, the cutoff for each linguistic domain correctly points out at least 74 % of PWA; 6 of them point out 85 % or more.The Word fluency subtest alone has a sensitivity of 95 % (with specificity set at 95 %).This suggests that the Word fluency subtest would function as a quick rough indicator of aphasia on its own.

Domain means
The cutoff value for the domain means (65.75) yields the maximum sensitivity (100 % of the PWA in the sample) and an acceptable specificity (95.2 % of the control group sample, per definition).Swinburn et al. (2004, p. 120) suggest using the cutoff values of the individual linguistic domains as discrimination criterion; more than 1 of 8 domain T-scores below the cutoff value of the domain indicates aphasia.In the CAT-N, this procedure correctly identifies 95.2 % (80 of 84) of persons without aphasia and 98.8 % (83 of 84) of PWA.In addition, this simpler procedure obviates the need for mean values or other calculations which may lead to mistakes.Interestingly, the two procedures do not yield the same false positives from the CG, indicating that accuracy could potentially be improved by combining procedures.However, with no available method for validation, such combinations of methods may result in overfitting and prove without merit when applied clinically or more generally to a new sample.

Concurrent validity
A subsample of n=34 of the PWA in the normative sample were also tested with Norwegian Basic Aphasia Assessment (NBAA; Reinvang & Engvik, 1980; see also Sundet & Engvik, 1985) and we calculated the correlation between the relevant domains of the CAT-N and subtests of NBAA, and between the domain mean of the CAT-N and the overall aphasia coefficient of the NBAA (Table 14).Not all PWA completed all subtests of the NBAA, resulting in variation in sample size (n).We compared the CAT scores for the subsample of 34 PWA with the remaining 50 in the full sample and found no differences.The subsample can therefore be considered representative of the full sample.All correlation coefficients are positive, as expected, the weakest correlation being for auditory comprehension (0.41).The two tests differ in number and types of test items and methods, especially for auditory comprehension, which may explain the fairly weak correlation for this domain.All the other domains have strong correlations, but the wide confidence intervals demonstrate the level of uncertainty; for Writing, the lower end of the confidence interval is as low as 0.11.

Construct validity
The construct validity of a test concerns the extent to which it is consistent with the underlying theoretical understanding of its object of study.An important aspect of the construct validity of the CAT-N is the construction of its domains, which we attempted to validate by performing an exploratory factor analysis on the T-scores from the 22 linguistic subtests.The minimal adequate model explains 74 % of the variance and has 6 factors as opposed to the 8 pre-defined domains of the test.Table 15 shows factor loadings for the 22 variables on the 6 factors.Stevens (2009, p. 333) recommends considering only variables with factor loadings about 0.4 or greater for interpretation purposes.We chose a somewhat higher threshold in order to minimise the number of subtests contributing to more than one factor; only 3 subtest/factor combinations were affected by this.The salient subtests (> 0.475) are highlighted in the table, showing that 5 of the subtests belong to 2 domains, but that the great majority of subtests uniquely contribute to one domain only, given the chosen threshold.
With the exception of Action naming clustering with Repetition, all factors in the analysis are conceptually coherent, even though they do not correspond fully to the pre- Last, and least important in terms of the contribution of the factor to the explanation of the total variation, Factor 6 consists of those two tasks of Repetition involving sequences.We see no conceptually substantial reason to distinguish between different types of Repetition, and as these two tasks are also included in Factor 2, we disregard Factor 6 in the following, even though it is part of the minimal adequate model.
Table 16 shows values of Cronbach's alpha for these 5 alternative domains, of which none correspond fully to the pre-defined domains.The table demonstrates that the values of Cronbach's α are all 0.87 or above and generally a bit higher than for the pre-defined domains (shown in Table 10); this is hardly surprising, given that the aim of a factor analysis is to find the "best" clustering of variables into factors.Merging the two picture descriptions into Spoken and Written production, respectively, seems conceptually sensible and reduces the number of domains by 2. The fact that Reading comprehension clusters with both (general) Comprehension and Reading distinguishes it from Auditory comprehension and indirectly also from the subtests of Reading related to decoding.
It is important to realise the limitations of a factor analysis.The factor analysis is set to discover a certain number of factors and rotates the multidimensional space with the aim of letting the factors emerge.This is a very powerful tool which tends to exaggerate the salience of the factors and understate the relationships between the variables which are not found to be part of the same factor.It is important to realise that all the linguistic subtests correlate positively in the normative sample, although a few of the correlation coefficients are close to zero and their confidence intervals contain zero.Also, some of the subtests which are not found to be part of the same factor correlate strongly -in some cases more strongly than some of the correlations within a factor.

Concluding remarks
We have presented the Norwegian version of the CAT, with emphasis on its psychometric properties.The CAT-N is a general aphasia test which examines linguistic skills broadly, within the cognitive neuropsychological tradition.We have shown that it distinguishes well between PWA and persons without aphasia and we have documented issues of reliability and validity.
As mentioned in the introduction, the development of CAT-N is part of an effort to develop standardised and comparable aphasia tests for different languages.Numerous factors which are difficult to control for may affect test attributes, test results and test statistics, such as demographic differences in populations, morphological and orthographic differences between languages, and arbitrary differences in sampling of test persons or in the difficulty of subtests or individual test items.As CAT is now adapted for several languages, cross-linguistic analysis of test properties becomes feasible and will be a natural next step within this co-operative initiative.So far, a comparison of CAT-N and the Croatian version of CAT has been carried out (Matić Škorić et al., submitted), finding good general correspondence, but also some differences in individual subtests and one domain.We welcome more contrastive studies and believe transparency and thorough documentation to be key issues in that respect.
We have two observations concerning the current CAT construct.First, the Semantic association subtest (called "Semantic memory" in the CAT-EN) "assesses access to semantic memory" (Swinburn et al., 2004, p. 15), whereas Recognition memory assesses nonverbal, visual recognition memory.In the original CAT-EN, the two correlate (Swinburn et al., 2004, p. 46) and are combined into a common Memory domain.In the CAT-N, this correlation is weak and the internal consistency of the Memory domain is "unacceptably" low.Hence, the analyses from the CAT-N do not confirm a common construct of memory and indicate that Semantic association and Recognition memory are two fairly unrelated capacities.
Second, the factor analysis we carried out to investigate the construct validity of the CAT-N indicates a simpler construct than the 8-domain model of the CAT.It is worth noting that the simpler model comprising only 5 domains would seem to match the traditional view of fundamental linguistic modalities (listening, speaking, reading, writing) better than the original 8-domain model.On the other hand, the inclusion of some of the subtests in more than one domain complicates the model and thereby the construct.Our analyses demonstrate somewhat improved internal consistency, and it would be interesting to assess it in terms of sensitivity and specificity.The factorial model for CAT-N has more factors than the factorial models for both the English and the Hungarian versions of the CAT (Swinburn et al., 2004, p. 117;Zakariás & Lukács, 2022, p. 1139), both of which have three.Comparing factorial models of different data is inherently difficult, however, especially as the method used by Zakariás and Lukács deviates from ours in several aspects and delimiting criteria are not documented by Swinburn et al.Finally, an important aim in writing this article has been to make explicit and unambiguous some of the design decisions we have made in constructing the CAT-N, such as the scoring scheme for Picture descriptions, the reordering of items in the subtests with abortion rules, the transformation of raw scores into T-scores, the calculation of divergence in T-scores between Word fluency and Object naming, and the determination of cutoff value for the domain means.It is our hope that this documentation will be helpful in future adaptations of the CAT.

Note.
Pess = pessimum; Opt = optimum; N = number of persons; Med = median; M = mean; SD = standard deviation; Min = minimum; Max = maximum; C = 95 % cutoff; n<C = number of PWA below cutoff.* There is no theoretical optimum for Word fluency; the value is set to the maximum obtained among PWA.
. Hence, considerably lower scores for Word fluency than for Object naming may indicate a need for further evaluation of executive functions.The T-scores of both these subtests extend over almost the entire theoretical range from 25 to 75 (Word fluency: 29-75, Object naming: 27-73), and they are both close to normally distributed (W≈0.992,p≈0.88, γ 1 ≈0.036;W≈0.994, p≈0.96, γ 1 ≈-0.021).We therefore calculated the divergence between the subtests by simple subtraction of the T-scores of Object naming from the T-scores of Word fluency, and subsequently transformed the resulting difference values into T-scores.The difference values were close to normally distributed (W≈0.982,p≈0.29, γ 1 ≈-0.233) and yielded T-scores very close to normality (W≈0.995,p≈0.99, γ 1 ≈-0.017).

Figure 2 .
Figure 2. Profile Diagram for the 9 Domains.Note.The profile diagram indicates optimum and pessimum T-scores for the domains (solid lines), the 95 % cutoff values (dotted line) and the theoretical mean and median of 50 (dashed line).The idea and design of the diagram is taken from Swinburn et al. (2004, p. 102).

Sum (Content parameters + form parameters) /17 Figure 1. The
Scoring Sheet for Oral Picture Description, Translated From Norwegian.Note.The scoring sheet for the Written picture description is identical, apart from the fluency item being excluded.

Table 1 .
Overview of the 27 Subtests and 9 Domains of the CAT-N.

Table 2 .
Some Characteristics of the Test Group of People With Aphasia and the Control Group.Note.One PWA became unavailable for testing after subtest 12. N = Sample size.

Table 3 .
T-scores and CorrespondingPercentiles in a Normal Distribution.many with the optimum as the mode value.Transforming such distributions into truly normal distributions is mathematically impossible.Consequently, T-scores for most subtests are not truly normal, but remain left-skewed and with prominent extreme mode values.Also, the maximum T-scores will not be 75 in such left-skewed distributions, but below 75.Hence, the T-score-to-percentile conversion table above will not be accurate for subtests with heavily skewed distributions.Also, since the skew varies between the subtests, T-scores are not always directly comparable between subtests.
Similarly, they are not directly comparable between different language versions of the CAT.

Table 4 .
Statistics Indicating the Amount of Deviation From the Normal Distribution for the Domain T-scores.

Table 5 .
Raw Scores for Subtests, Control Group and PWA.

Table 6 .
Raw Scores for Domains, Control Group and PWA.Note.Pess = pessimum; Opt = optimum; N = number of persons; Med = median; M = mean; SD = standard deviation; Min = minimum; Max = maximum; C = 95 % cutoff; n<C = number of PWA below cutoff.* There is no theoretical optimum for Naming, since the domain contains the subtest for Word fluency; the value is set to the maximum obtained among PWA.

Table 7 .
T-scores for Subtests, Control Group and PWA.

Table 8 .
T-scores for Domains, Control Group and PWA. in distribution of gender and education levels shown in Table2have affected the results, e.g. in terms of cutoff values.5 PWA have an L1 other than Norwegian; they do not deviate systematically from the native speakers.The participants were grouped into 4 dialect areas; there were no effects of dialect.There is no effect of time from onset on the domain means.

Table 10 .
Internal Consistency of the 9 Domains and the 22 Subtests in the Language Battery.

Table 11 .
Inter-rater Reliability for Picture Descriptions.

Table 12 .
Intra-rater Reliability for Picture Descriptions.

Table 13 .
Comparison of T-scores Between the Control Group and PWA.Note.N c = number of controls; N a = number of PWA; W = statistic from Wilcoxon test; p = p-value from Wilcoxon test; z = z-value from Wilcoxon test; r pb = point-biserial correlation coefficient; ppn<C = proportion of PWA below the 95 % cutoff of the control group.Domains are in bold.

Table 14 .
Correlations Between the CAT-N Domains and Subtests of NBAA.
Note. n = number of persons; r s = Spearman's correlation coefficient; CI = confidence interval.Comparisons were made on a subsample of n = 34 of the CAT-N normative sample.

Table 15 .
Factor Loadings Resulting From Exploratory Factor Analysis.Factor 2 corresponds to the existing Repetition domain, with the exception of Action naming, the inclusion of which we are not able to explain.Factor 3 collects Writing and Written picture description into one common domain of Written production.Factor 4 collects Auditory and Reading comprehension into one common domain of Comprehension.Factor 5 collects (spoken) Naming and Spoken picture description into one common domain of Spoken production.All these 5 factors constitute conceptually coherent entities.

Table 16 .
Internal Consistency in the 5 Linguistic Domains Resulting From the Factor Analysis.