Improving the measurement properties of the Amyotrophic Lateral Sclerosis Functional Rating Scale-Revised (ALSFRS-R): deriving a valid measurement total for the calculation of change

Abstract Background The Amyotrophic Lateral Sclerosis Functional Rating Scale-Revised (ALSFRS-R) total score is a widely used measure of functional status in Amyotrophic Lateral Sclerosis/Motor Neuron Disease (ALS), but recent evidence has raised doubts about its validity. The objective was to examine the measurement properties of the ALSFRS-R, aiming to produce valid measurement from all 12 scale items. Method Longitudinal ALSFRS-R data were collected between 2013-2020 from 1120 people with ALS recruited from 35 centers, together with other scales in the Trajectories of Outcomes in Neurological Conditions-ALS (TONiC-ALS) study. The ALSFRS-R was analyzed by confirmatory factor analysis (CFA), Rasch Analysis (RA) and Mokken scaling. Results No definite factor structure of the ALSFRS-R was confirmed by CFA. RA revealed the raw score total to be invalid even at the ordinal level because of multidimensionality; valid interval level subscale measures could be found for the Bulbar, Fine-Motor and Gross-Motor domains but the Respiratory domain was only valid at an ordinal level. All four domains resolved into a single valid, interval level measure by using a bifactor RA. The smallest detectable difference was 10.4% of the range of the interval scale. Conclusion A total ALSFRS-R ordinal raw score can lead to inferential bias in clinical trial results due to its non-linear nature. On the interval level transformation, more than 5 points difference is required before a statistically significant detectable difference can be observed. Transformation to interval level data should be mandatory in clinical trials.


Differences between TONiC patient reported ALSFRS-R and clinician administered Cedarbaum, ENCALS and NEALS versions of ALSFRS-R
The original scale was published by Cedarbaum (1).The European Network to Cure ALS (ENCALS) has published a version (2) as has Northeast Amyotrophic Lateral Sclerosis Consortium (NEALS) (3).
The TONiC patient reported ALSFRS-R was based on that validated by Montes (4).
All items match the clinician administered versions, usually without change in wording.
Most modifications were minor edits to use lay language.
There were several changes arising from the cognitive debrief.Some addressed the need from patients for clarity of wording.For example, item 11: Orthopnoea (2) -'needs more than 2 pillows': patients criticised this if they were using specialist profiling beds to achieve the effect on posture of using more than 2 pillows, so wording was altered to two pillows or an equivalent.Item for the Rasch fit statistics (3,4).

2b. Methods of Rasch Analysis
Data from each (sub)scale was tested against the requirements of the Rasch Measurement model (5).Briefly, these requirements include: i) unidimensionality; ii) monotonicity; iii) homogeneity; iv) local independence and v) group invariance (6,7).
Whichever set of items are to be added together to provide a score, they should satisfy all these requirements.They should: i) measure one thing (domain/construct/trait; ii) the probability of a positive response to an item (or in the case of polytomous items, the transition from one response category to the next) should increase with underlying ability, as should the total score (8); iii) the same hierarchical ordering of items should hold for each level (or grouping) of the score (9); iv) items should be conditionally (on the score) independent of one another (10); and v) the response to items across different groups such as age or gender should, conditioned on the total score, be the samereferred to as (the absence of) Differential Item Functioning (DIF) (7).
Whichever set of items are to be added together to provide a score, they should satisfy all these requirements.
Each requirement is tested.A t-test is used to determine if two separate groups of items deliver significantly different estimates, following the procedure given by Smith (11).The hierarchical ordering of items across the scale is determined through a Chi-Square test of fit based on grouped scores.Monotonicity is evaluated through inspection of the item-category ordering.Conditional item dependence is determined though the correlation of residuals, where pair-wise correlations should not exceed 0.2 above the average residual (12).Where pairs of items were identified as locally dependent, they were merged into super items (i.e.identified post-hoc from the residual correlations).For item sets with subscale structures, the items can be grouped as subscale testlets, simply adding them together to make one larger item to absorb the local dependency (i.e. an a priori definition) (13).In the RUMM2030 software, this gives a bi-factor equivalent solution retaining a specified proportion of the variance.
This "Explained Common Variance (ECV)" is reported, whereby a value less than 0.7 is indicative of requiring a multidimensional model, a value above 0.9 a unidimensional model, and the grey area in between, undetermined, requiring further evidence (14).
Consequently, value of the ECV at 0.9 and above is considered acceptable in the current analysis.If two parallel forms are created from either a subscale structure, if present, or from the pattern of local dependency in the item set, this requires a latent correlation ≥ 0.9.This is consistent with the reliability required for individual use (15).
Consequently, valid parallel forms would require both their latent correlation to be ≥0.9 and the ECV to be ≥0.9.
The scales were also tested for invariance (differential item functioning -DIF); DIF occurs when subjects on the same level of the latent trait, such as disability, answer the same item differently depending on their group memberships (e.g.age, gender) (16).DIF was examined for a series of contextual factors including age, gender, time-point (repeated measurement), duration of ALS (grouped into quartiles) and onset type (limb, bulbar or respiratory).Should DIF be identified it is tested by a comparison of person estimates from split and unsplit solutions to see if it is 'substantive' (17).Where the difference is significant (a paired t-test), the result is reported as an effect size where a value higher than 0.1 is considered to represent substantive DIF (18).If this is present, then the scale works in different ways for the contextual factor under consideration, and results are reported separately.Finally, reliability is reported as both a Person Separation Index (PSI), and as Cronbach's alpha.If data are normally distributed they are equivalent, but otherwise PSI tends to be lower the more data are skewed.Values are treated the same, and so values below 0.7 would be described as low, as they do not support group use.
A hierarchical approach to seeking fit of the data to the model for existing scales is adopted, with level 1 as the priority (Supplementary File: Table 1).All aspects listed above must be met.Should a level 5 solution be unavailable, item deletion will be considered (level 6).If this fails then level 7 will be utilised to test if the scale satisfies ordinal scaling; if not level 8 remains, indicating failure.

2c. Detailed Results of Fit to the Rasch Model
The Bulbar, Fine-Motor and Gross-Motor subscales each achieved acceptable fit to the model, with occasional DIF which was not replicated across samples.The Gross-Motor scale displayed local item dependency between two variables ('walking' and 'climbing stairs'), which were made into a single super item.Note that 17% of the variance had to be discarded to make a unidimensional latent estimate of the Gross- DIF was also present for onset type but the effect size of the difference between estimates was just 0.019 and so no further action was taken.When the full calibration sample was applied to the total score, 18% of the variance had to be discarded to achieve a unidimensional latent estimate, given model fit.Parameters from any valid solution within the calibration sample were then imported into the main data set to obtain the most accurate estimates for the ALSFRS-R.

Table 1 . Strategies seeking fit of the data to the Rasch model.
In the validation sample, 15% of the variance had to be discarded to achieve this solution.
A total score achieved satisfactory fit to the model under a bi-factor solution, based upon two testlets, one containing all the items from the Bulbar and Respiratory subscales, and the other containing the remaining Limb (Fine-Motor and Gross-Motor) domain items.