Measurement properties of the Arm Function in Multiple Sclerosis Questionnaire (AMSQ): a study based on Classical Test Theory.

PURPOSE
The construct validity, test-retest reliability, and measurement error of the Arm Function in Multiple Sclerosis Questionnaire (AMSQ) were examined. Additionally, the influence of administration-method on reliability and measurement error was investigated.


METHOD
112 Dutch adult MS-patients from an academic- and a residential care-facility participated. Questionnaires were administered on paper, online or as interview, and patients performed several performance tests. Construct validity was assessed by testing pre-defined hypotheses. Reliability was assessed using Intraclass Correlation Coefficients (ICCs), Standard Error of Measurements (SEMs) and Smallest Detectable Changes (SDCs).


RESULTS
For construct validity (N = 105) 9 of 13 hypotheses were confirmed (69%). As expected, the AMSQ showed moderate to strong relationships with the instruments measuring similar constructs. The test-retest reliability coefficient was 0.96 (95% Confidence Interval 0.94-0.97); SEM was 6.3 (6.3% of scale range); SDC was 17.5 (on a sale from 0 to 100). Different administration-methods showed good reliability (ICC 0.88-0.94) and small standard errors (SEM 5.6-7.2).


CONCLUSION
The AMSQ shows satisfying results for validity and excellent reliability; allowing for proper use in research. Due to a large SDC value, caution is needed when using the AMSQ in individual patient care. Further research should determine whether the SDC is smaller than the minimal important change. Implications for Rehabilitation The Arm Function in Multiple Sclerosis Questionnaire (AMSQ) measures activity limitations due to hand and arm functioning in patients with Multiple Sclerosis (MS). Results of this study confirm adequate validity and reliability of the AMSQ in patient with MS. The equivalence of scores from online, paper or interview administration is supported. A change score of ≥18 points on the scale of the AMSQ (on a scale 0-100) needs to occur to be certain a change beyond measurement error has occurred in an individual patient.


Introduction
Multiple Sclerosis (MS) is a progressive disease and is highly associated with increased physical disability. Limitations in hand and arm functioning are present in up to 76% of patients [1][2][3][4] including patients with low disease severity. [3] Impairments may include tremor, coordination deficit and muscle weakness. [5] Limitations in hand and arm function have significant negative implications on activities of daily living (ADL) (e.g., eating, dressing, grooming), [6] living independently, and quality of life (QoL), [7][8][9] and are associated with high societal costs due to loss of work and high direct costs due to utilization of care. [10] Given their impact, valid assessment of hand and arm functioning is important in both clinical practice and clinical trials, i.e., for the comprehensive evaluation of therapeutic effectiveness and in the development of treatment strategies. [11,12] Patient Reported Outcome Measures (PROMs) capture the patient's perspective of a person's health condition and are advised when measuring functioning. [13,14] Moreover, along with the recognition of the importance of patient-centered care, PROMs have gained increased importance in the assessment of MS. There are several PROMs for measuring upper extremity function. [15][16][17][18][19] However, no unidimensional disease specific PROM is available for measuring arm and hand functioning in persons with MS. Therefore, the Arm Function in Multiple Sclerosis Questionnaire (AMSQ) was recently developed [20] to measure activity limitations due to hand and arm functioning in patients with MS. The term "activity limitations" was defined according to the International Classification of Functioning, Disability and Health (ICF) as any difficulties an individual may have in executing activities. [21] The AMSQ was based on a literature review and with involvement of experts and patients, and using item response theory (IRT) methods. [20] More details are published elsewhere. [20] In total, 31 items were included in the questionnaire constituting a unidimensional scale. All items are formulated as "during the past two weeks, to what extent has MS limited your ability to … ". Response categories are not at all, a little, moderately, quite a lot, extremely, and no longer able to. The aim is to develop a computer adaptive test in the future, requiring an item bank containing a sufficient number of items that cover the range of activity limitations due to hand and arm functioning for patients with MS. In the present study, we further investigated the quality of the AMSQ using classical test theory (CTT) methods. The aim of this study was to investigate construct validity, test-retest reliability, and measurement error. Often in research, different modes of administration are used, such as self-report, or interview. Moreover, a self-report version can either be administered at the clinic or at home, by paper and pencil or online. In general, different ways of administration can influence the scores. Therefore, we subsequently investigated whether the mode of administration influenced the reliability and measurement error of the AMSQ.

Study design and patients
In this test-retest design, a prospective cohort of patients with MS was recruited at the VU University Medical Center (VUmc) in Amsterdam and the residential and facility center for physically handicapped, Nieuw Unicum (NU) in Zandvoort, both in the Netherlands. A sample of patients was enrolled at the VUmc at regular patient visits at the outpatient clinic or in the context of clinical ongoing research projects. In addition, patients were recruited from advertisements on Dutch websites, i.e., www. msweb.nl and www.msvamsterdam.nl. Patients at NU were recruited by the staff of the center. Inclusion criteria were: selfindicated diagnosis of any type of MS, age above 18 years, and adequate understanding of the Dutch language. Patients with clinically observable severe cognitive impairments were excluded. The study was approved by the Medical Ethical Committee of the VU University Medical Center, Amsterdam; the Netherlands (reference number 2012/296). Data collection was carried out between November 2012 and June 2013.

Procedures
Patients were examined at VUmc or at NU. All patients were asked to sign an informed consent form, and to complete a questionnaire, containing demographic variables and disease specific questions (i.e., age, gender, disease duration, and MS type), a selfadministered version of the Expanded Disability Status Scale (EDSS), [22] and several other PROMS including the AMSQ (see below) provided online or on paper. Furthermore, all patients were interviewed to assess level of disability due to MS (i.e., The Guy's Neurological Disability Scale, see below) and were asked to perform several performance tests (see below). At NU, the questionnaires were administered as interview, as most patients were unable to fill out the questionnaires themselves due to hand and arm impairments. At VUmc, patients self-administered the questionnaires.
After two to four weeks we asked patients again to complete the AMSQ, assuming that this time interval was sufficient to minimize recall bias, yet short enough for their hand and arm function to remain unchanged. To check whether patients were stable in the meantime, a Global Perceived Effect (GPE) scale about perceived changes in hand and arm functioning was administered (see below).

Patient reported outcome measures
The AMSQ measures activity limitations due to hand and arm function in patients with MS. All items fitted into the graded response model, which is an IRT model, and no differential item functioning (DIF) was found for the variables type of MS, gender, administration version, and test length. IRT based reliability was 0.95. [20] The items of the AMSQ are provided in Appendix 1. As this study is based on CTT, we calculated sum scores (range 0-100) instead of IRT-based trait level (h) scores. Higher scores indicate more limitations in hand and arm function. No scores were imputed, only complete cases were used. The Guy's Neurological Disability Scale (GNDS) is an structured interview assessing level of disability (i.e., activity limitations). [23,24] The GNDS contains 12 functional domains, but we used only the total GNDS score (range 0-60) and the upper limb disability sub score (range 0-5).
Each domain contains four to ten dichotomous items. Based on the given answers, domain scores are ascribed on a six-point severity scale, e.g., the upper limb disability scoring ranges from no upper limb problem (score ¼ 0) to unable to use either arm for any purposeful movements (score ¼ 5). No missing variables were expected, as the scale was administered as interview.
The Multiple Sclerosis Impact Scale-29 (MSIS-29) measures the impact of MS on daily living, comprising a physical impact scale (MSIS-29 physical) and a psychological impact scale (MSIS-29 psychological). [27,28] All items have a fivepoint Likert scale ranging from not at all to extremely, with higher scores indicating higher impact (range physical subscale 20-100; range psychological subscale 9-45). The Multiple Sclerosis Impact Profile (MSIP) measures disability in patients with MS. Only the subscale measuring activities of daily living was utilized. [29] Answering is on a four-point scale, ascending in degree of need for supporting tools and/or help from others, i.e., higher scores indicated more dependence (range 7-28). GPE scale. [30] At follow-up patients were asked How would you rate your hand/arm functioning, compared to two weeks ago?. Response options were: (1) much better than two weeks ago, (2) somewhat better than two weeks ago, (3) about the same as two weeks ago, (4) somewhat worse than two weeks ago, and (5) much worse than two weeks ago. Patients who reported much better or much worse on the GPE scale were regarded as unstable patients and were excluded from reliability analyses.
Missing items on the RAND-36, MSIS-29 and MSIP were imputed as recommended with patient-specific mean values of completed items.

Performance tests
All administered performance-based tests are designed to measure each hand or arm independently. Scores were obtained for both hands/arms. Because the AMSQ was designed regardless of hand-dominance, we averaged the scores of the dominant and non-dominant hand for each performance based test.
The Action Research Arm test (ARAT) was used to measure fine and gross motor dexterity. [31] The test consists of five subtests: grasp (6 items), grip (4 items), pinch gross (4 items), pinch fine (4 items), and gross movement (3 items), comprising a total of 19 movements to be performed by the patient. Each movement is scored on a scale from no movement possible (score ¼ 0) to normal movement (score The Nine Hole Peg Test (9-HPT) was used to measure upper extremity function, [32] and involves placing and removing nine pegs in a pegboard. The time to perform the test was measured.
The Coin Rotation Task (CRT) was used to measure fine motor dexterity of the hands, [33] and involves rotating a US five-cent coin as fast as possible using the thumb, index and middle fingers. The time to perform 20 halfturns was measured.
The hand held JAMAR dynamometer was used to measure isometric grip strength of the hand. [34] The test was performed following standardized instructions and positioning recommended by the American Society of Hand Therapists. [35] The Modified Ashworth Scale (MAS) was used for measuring muscle spasticity, [36] by measuring resistance to passive movement about the elbow joint on a 6-point scale from no increase in tone to limb rigid in flexion or extension (range 0-5). MAS scores were dichotomized into no spasticity (i.e., MAS ¼ 0) and spasticity (i.e., MAS ! 1). Except for the ARA test and MAS, all performance tests were administered twice on both hands and the best value was taken as score for each hand. If a patient could not perform a timed-test due to hand and arm impairments, a maximum value of 300 seconds was used (according to the manual of the NHPT and used for other tests [37]).

Statistical analyses
We produced descriptive statistics (means, medians, and SDs) for the scores of the measurements, and investigated the frequencies of missing data.

Construct validity
Construct validity was assessed by the degree to which the sum scores of the AMSQ were consistent with predefined hypotheses regarding relationships between the AMSQ and the other measures. We formulated 13 hypotheses presented in Table 1. Moderate to high correlations are expected between the AMSQ and other PROMs measuring physical functioning (hypotheses 1-6). Low correlations are expected between the AMSQ and other PROMs measuring different constructs (hypotheses 7, 8). Moderate to high correlations are expected between the AMSQ and all performance tests, as they all assess aspects of upper limb functioning. Though, a hierarchy in strength of the linear relationship between the AMSQ and the different performance measures was expected (hypotheses 9-12), i.e., the ARAT reflects the same construct as the AMSQ, i.e., "hand/arm functioning", and therefore the strongest correlations coefficient was expected. The 9-HPT, CRT and JAMAR hand strength dynamometer measure narrower constructs compared to the construct measured by the AMSQ, and therefore, lower correlation coefficients were expected. In addition, one hypothesis regarding expected differences in AMSQ mean sum scores in patients with spasm and patients without spasm was defined (hypothesis 13).
Spearman's rho correlations were used for assessing all hypothesized relations between the AMSQ and PROMs and performance-based measures because scores were non-normally distributed. Correlation was considered as low <0.30; moderate 0.30-0.59; and high !0.60. [38] Group comparison (patients with spasm versus patients without spasm) was made by a Mann-Whitney U Test with a p cutoff value of 0.05.

Reliability
To investigate test-retest reliability of the AMSQ, we calculated one-way ICC for the whole sample due to an incomplete design [39] (Reliability question 1). The questionnaires were administered online, on paper, or as an interview. Mode of selfadministration could vary between baseline and retest measurement (i.e., a patient filled out the baseline questionnaire on paper at the VUmc, and completed the retest online at home). For patients at NU baseline AMSQ was administered by two researchers (L.v.L. or L.M.), and follow-up administration was performed by one of the eight physiotherapists following a time interval of two to four weeks after initial assessment. In addition, we investigated whether there was a systematic difference between two measurements due to differences in mode of administration (Reliability question 2), and whether there was a systematic difference between two measurements due to different observers (Reliability question 3). These two questions were investigated by calculating ICC two-way ANOVA random effect models for agreement for patients who completed baseline and retest questionnaires in different ways of administration (i.e., paper at baseline and online at follow-up), and for patients who were interviewed. An ICC value of 0.70, in a sample of 50 patients was recommended as a minimum standard for reliability. [40] Measurement error The measurement error was determined by calculating the standard error of measurement (SEM), [41] i.e., the square root of the error variance from the ICC formula. In addition, measurement error was expressed as smallest detectable change (SDC). The SDC represents the minimal change that a patients must show on the scale to ensure (with 95% confidence) that the observed change is real and not just measurement error. The SDC was calculated at a 95% confidence interval by multiplying the SEM by 1.96 and by the square root of 2. [30] All statistical analyses were performed using the Statistical Package for Social Science (SPSS) version 20.0 (Chicago, IL).

Patient characteristics and response rate
Patient characteristics are presented in Table 2. In total 112 patients with MS participated, of which 77 patients were recruited at the VUmc and 35 patients at NU. All subjects from NU were residential. Sum scores on the AMSQ were only calculated for patients who completed all 31 items (94%). Five patients had one missing item and two patients had two or three missing items. The total of missing items of the AMSQ was less than 1%. Mean sum score of the AMSQ was 27.8 (SD ¼ 31.8) and the median was 12.9 (range 0-100). For the validity analyses we used the scores of the 105 patients with complete cases on the AMSQ. With regard to the follow-up measurement, 14 of the 77 patients that used selfadministration did not complete the follow-up questionnaire (response rate of 82%), and four patients did not fully complete the AMSQ. All 35 patients who were interviewed completed the follow-up measurement, but three patients had missing items on the AMSQ. Taken together, 91 patients were eligible for reliability analyses. Somewhat lower correlations were expected between the AMSQ and 9-HPT, because in contrast to the ARAT, the 9HPT focuses more specifically on hand function and finger dexterity while the AMSQ assesses the whole upper extremity.

11
It was expected that the AMSQ will show somewhat lower correlations with the CRT as compared with the correlation between the AMSQ and 9-HPT, because the CRT focuses even more specifically on the hand and finger dexterity as compared with the 9-HPT. had no scores, because they could not perform the test due to practical problems. f 13 and 16 patients had a value of 300 seconds on dominant and non-dominant hand, respectively, because they could not perform the test due to hand and arm impairments. g 26 and 28 patients had a value of 300 seconds on dominant and non-dominant hand, respectively, because they could not perform the test due to hand and arm impairments.

Construct validity
The correlation coefficients between the AMSQ mean sum score and the mean values of other measures are presented in Table 1.
In summary, 9 out of the 13 predicted hypotheses were confirmed (69.2%). As expected, moderate to high correlation coefficients were found between the AMSQ and (sub) scales measuring physical functioning (hypotheses 1-6) and low correlation coefficients were found between the AMSQ and (sub) scales measuring nonsimilar constructs (hypotheses 7 and 8). The correlations between the AMSQ and all performance-based hand and arm function tests (hypotheses 9-12) were moderate to high, as expected. Although the hierarchical order of the three comparator tests was not as expected, i.e., the ARAT showed a lower correlation coefficient with the AMSQ when compared with the observed coefficients of the other performance tests. In line with expectations, we found a significant difference between patients with spasticity (N ¼ 30) versus no spasticity (N ¼ 71) on the AMSQ sum score (U ¼ 557.5; p < 0.05).

Reliability
Five patients reported "much better" or "much worse" on the GPE and were excluded for analyses. Of the remaining sample (N ¼ 86), 55 patients self-administered the questionnaires and 31 patients were interviewed at baseline and follow up. Of the patients who self-administered the questionnaires, 43 patients used different modes of administration for baseline and follow-up measurement (i.e., paper at baseline and online at follow-up), and 12 patients used the same method (i.e., both administrations were online). None of the patients self-administered the questionnaire twice on paper. The results addressing the three research questions concerning reliability are presented in Table 3.

Discussion
The AMSQ is a newly developed tool to measure activity limitations due to hand and arm functioning in patients with MS. The first quality assessment by IRT methods showed good results, and in this second study we evaluated the psychometric properties of the AMSQ using traditional CTT methods. We assessed construct validity, reliability and measurement error. The validity analysis showed satisfying results for construct validity of the AMSQ (confirmed hypotheses 69.2%). As expected, the AMSQ showed moderate to strong linear relationships with PROMs measuring similar constructs and the performancebased tests. Hypotheses 3, 4, 10 and 11 were not confirmed. Surprisingly, a stronger relationship was found between the AMSQ and RAND-36 physical subscale (which is developed for the general population) when compared with the relationship between the AMSQ and MSIS-29 physical subscale, which similar to the AMSQ, was specifically developed for patients with MS (i.e., hypotheses 3 and 4). The items of the AMSQ and RAND-36 all ask only about limitations in performing activities, while 7 out of 20 items of the MSIS-29 physical subscale ask about limitations due to specific causes, such as balance, clumsy, stiffness, tremor, and spasms. It is likely that therefore the construct measured by the MSIS-29 is different from the constructs as measured by the AMSQ and the RAND-36. Furthermore, the correlations between the AMSQ and the ARAT in our study were not as expected (i.e., hypotheses 10 and 11). Originally, the ARAT was developed to assess upper extremity function following cortical injury. Although the ARAT was shown valid for measuring upper extremity functioning patients with MS, [42] the test seemed not suitable for measuring hand and arm functioning in the present study sample. One explanation might be that patients scored mostly the best possible score (i.e., 66% had an average score >55) despite claiming dexterous difficulties, as well as that in more severely disabled patients (EDSS !8) the test could not be administered due to practical problems (22%), e.g., wheelchair dependent patients could not reach all parts of the ARAT box. Similar results and remarks regarding validity and ceiling effects on the ARAT in MS patients have been reported in other studies.
The high-reliability coefficients and low measurement error support the value of the AMSQ in clinical trials with relatively small sample sizes, regardless of different modes of administration both between and within patients. The reliability coefficient determined on different modes of administration provided good support for the equivalence of scores from online, paper or interview administration. This corresponds to findings of our previous study, [20] which showed no DIF on any of the 31 items of the AMSQ for mode of administration. In clinical practice a SEM of (approximately) 6 points means that when a patient gets a score of for example 28 points, in reality the score will lie somewhere between 22 points and 34 points. The SDC was 17.5 points (on a scale of 0 to 100). This means that when an individual patient is measured over time, a change score of at least 18 points needs to occur in order to conclude that in reality (with a 95% certainty) a change beyond measurement error has occurred in an individual patient. The SDC for self-report was smaller than the SDC for interview (15.6 vs. 20.0). This could be taken into account when using the instrument, for example in a randomized controlled trial. It could also be possible that the measurement error for patients who have lower functioning level is larger, which were the patients that were interviewed. For the interpretation of change scores of PROMs, results on both the SDC and the minimal important change are needed (MIC; i.e., the smallest change in score which patients perceive as important). An instrument is useful in clinical practice if the SDC is smaller than the minimal important change. A next step therefore is to determine whether the smallest detectable change is sufficiently small, i.e., smaller than the minimal important change. Note that when using the ICC: intraclass correlation coefficient; CI: confidence interval; SEM: standard error of measurement; SDC: smallest detectable change; NA: not applicable; r 2 p the variance of the patients (i.e., the systematic differences between the "true" scores of the patients); r 2 o the variance due to systematic differences between the measurements/observers; r 2 residual the random error variance; d -: the systematic difference.
instrument in a study to measure change in a group of patients, the measurement error of the mean change score is much lower (SDC/ͱn). [41] Since the AMSQ is developed using IRT methods, trait level (h) scores can be obtained. The considerable advantage of theta scores is their ability to handle missing data. The analyses were repeated using trait level (h) scores. The correlation between the sum scores and theta scores was 0.92. Similar results were obtained for construct validity as well as for reliability analyses (data not shown).

Study strengths and limitations
According to international guidelines, a minimum of 50 patients is considered adequate for assessing measurement properties. [43] We included 102 patients for validity and 86 patients for reliability analyses and thus largely met this criterion. Furthermore, we had little missing data on the AMSQ and all other questionnaires. Therefore, we do not expect that the missing data has led to bias. Some limitations need to be addressed regarding the present study. Patients self-indicated their diagnosis of MS. Because of ethical and regulatory restrictions, the diagnosis of MS was not confirmed by accessing medical records. This limitation might limit the generalizability of the results. However, given the study sample largely consisted of patients that either visited the academic hospital (VUmc) for treatment or were admitted to a residential care facility that is specialized in care for MS patients (NU), the vast majority of patients were known to us with the right diagnosis (MS). We are therefore confident the self-indicated diagnosis was valid and that this has not led to bias. Regarding the performance-based tests, missing data were mostly obtained due to practical problems using the ARAT, which could have led to bias. Furthermore, we averaged the scores of the dominant and nondominant hands on the performance-based tests. This could have introduced bias on the representativeness of the found correlations. However, the outcome did not change using only scores for the dominant and non-dominant hand, respectively (data not shown). Another limitation was that we had loss to follow-up for the second questionnaire (13%). However, the patient characteristics of non-responders were not different as compared with responders, dismissing that this may have caused different results. Furthermore, we used a GPE to define stable patients for reliability assessment. Such a measure has several limitations, [44][45][46] including questionable validity, recall bias and influence of current status. Although we cannot rule out recall bias, we believe clinically important changes were unlikely to occur in two to four weeks. Moreover, the negligible systematic differences that were found might be an indication that biological variance or recall bias had no influence. Unfortunately one item of the MSIP was inadvertently not included in this study. This might have led to bias regarding the correlation coefficient for the relation between the scores on the AMSQ and MSIP.

Conclusion and practical implication
The results of this study show satisfying results for validity and excellent results for reliability in a sample of Dutch patients with MS. This second evaluation of measurement properties support that the AMSQ is an adequate scale for measuring arm and hand functioning in patients with MS in clinical research. Further research will determine whether the same results apply to translations into other languages. An English version is currently under investigation in an Irish population, [47] and another study is ongoing for translating and validating the German version of the AMSQ. [48]