Validated repeatability of patient-reported outcome measures following primary total hip replacement: a mode of delivery comparison study with randomized sequencing

Background and purpose — Patient-reported outcome measures (PROMs) are used to understand better the outcomes after total hip replacement (THR). These are administered in different settings using a variety of methods. We investigated whether the mode of delivery of commonly used PROMs affects the reported scores, 1 year after THR. Patients and methods — A prospective test–retest mode comparison study with randomized sequence was done in 66 patients who had undergone primary THR. PROMs were administered by 4 modes: self-administration, face-to-face interview, telephone interview, and postal questionnaire. PROMs included: Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), Oxford Hip Score (OHS), EQ5D-3L (EQ5D), and Self-Administered Patient Satisfaction Scale (SAPS). Linear regression was used to estimate relationships between the mean scores for PROMs by mode. Individual paired differences by mode were calculated, relationships between modes were identified, and results adjusted by time delay and participant age. Results — There was no statistically significant difference between the mean PROM scores recorded for each mode of delivery for each score. Statistically significant differences in the individual paired differences were detected between modes for the WOMAC stiffness subscale, OHS, EQ5D, and SAPS. OHS difference in individual paired means between face-to-face and telephone interview exceeded the minimal clinically important difference. Interpretation — PROMs mode of administration can affect the recorded results. Modes should not be mixed and may not be comparable between studies. It should not be assumed that different modes will obtain the same results and where not already established this should be checked by researchers before use.

There are various metrics used to judge the success or failure of total hip replacement (THR). Hard endpoints such as revision of the THR and mortality are popular as they are easy to define, but such outcomes fail to take account of the degree of relief of symptoms experienced by the patient, i.e., soft endpoints (Wylde and Blom 2011). To better understand the outcomes after THR, patient-reported outcome measures (PROMs) have been widely adopted, and typically these PROMs are focused around domains such as pain, function, and stiffness. Their use has become routine and widespread, for example, the UK Department of Health's National PROMs program (http:// www.hscic.gov.uk/proms) administers PROMs prior to and 6 months after intervention for procedures such as THR, total knee replacement, hernia repair, and varicose vein surgery.
PROMs questionnaires are administered and completed in a number of different settings using a variety of methods (Tourangeau et al. 2000). Common modes of delivery include paper based, face-to-face, telephone, and computer delivered, with responses being self-recorded or assisted by a third party. With the evolution of technology, the boundaries between modes of delivery are now becoming blurred. In addition, preoperative and postoperative assessments are frequently performed using different modes of administration, and in many research studies a mixture of modes is used to ensure data completeness (Dillman et al. 2009).
When modes of delivery are mixed, it is important to understand whether the mode of delivery of the questionnaire affects the psychometric properties of the score (Honaker 1988). If different modes of delivery result in scores that are not equivalent, then these modes should not be mixed in a single study design. Factors that may be associated with the magnitude of difference include context, content, and the population studied (Hood et al. 2012). Systematic review of mode comparisons Background and purpose -Patient-reported outcome measures (PROMs) are used to understand better the outcomes after total hip replacement (THR). These are administered in different settings using a variety of methods. We investigated whether the mode of delivery of commonly used PROMs affects the reported scores, 1 year after THR.
Patients and methods -A prospective test-retest mode comparison study with randomized sequence was done in 66 patients who had undergone primary THR. PROMs were administered by 4 modes: self-administration, face-toface interview, telephone interview, and postal questionnaire. PROMs included: Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), Oxford Hip Score (OHS), EQ5D-3L (EQ5D), and Self-Administered Patient Satisfaction Scale (SAPS). Linear regression was used to estimate relationships between the mean scores for PROMs by mode. Individual paired differences by mode were calculated, relationships between modes were identified, and results adjusted by time delay and participant age.
Results -There was no statistically significant difference between the mean PROM scores recorded for each mode of delivery for each score. Statistically significant differences in the individual paired differences were detected between modes for the WOMAC stiffness subscale, OHS, EQ5D, and SAPS. OHS difference in individual paired means between face-to-face and telephone interview exceeded the minimal clinically important difference.
Interpretation -PROMs mode of administration can affect the recorded results. Modes should not be mixed and may not be comparable between studies. It should not be assumed that different modes will obtain the same results and where not already established this should be checked by researchers before use.
has shown that modes are vulnerable to bias when comparison is made between an interviewer being involved and selfcompletion (Hood et al. 2012). In a large population study, telephone administration yielded more positive health-related quality of life estimates than self-administration (Hanmer et al. 2007). However, multiple item scales are less prone to bias and differences between modes have ameliorated as technologies such as telephone-and computer-based completion have become commonplace (Gwaltney et al. 2008).
We investigated whether the mode of questionnaire delivery influences test scores in commonly used PROMS in primary THR.
Study design -Patients were invited to participate in a testretest study of 4 PROMs using 4 modes of delivery, 1 year following THR, using a randomized crossover design.

Patients and methods
A prospective mode comparison cohort study was conducted in a single NHS tertiary orthopedic center. In order to ensure that patients had reached a steady state in terms of their outcome following surgery, patients who were 1 year following THR were invited to participate (Lenguerrand et al. 2016).
Patients were eligible for inclusion in the study if they had undergone primary THR for any indication 1 year previously. Exclusion criteria were patients who had undergone revision THR, patients who were unwilling or unable to provide informed consent, and patients who were unable to understand or complete questionnaires in English (Study flow chart, Figure).
Patients were recruited to the study by written invitation sent 1 week before their outpatient clinic appointment. Recruitment occurred between June 2014 and October 2015. 66 patients, who indicated they wished to participate, were consented and randomized to the order in which they would receive the questionnaires. Participants were asked to complete a set of 4 questionnaires: the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC); the EQ5D-3L health questionnaire (EQ5D); the Self-Administered Patient Satisfaction Scale (SAPS); and the Oxford Hip Score (OHS). 4 modes of ques-tionnaire completion were used: self-administered in clinic; face-to-face interview in clinic; telephone interview; and postal questionnaire. For full details of PROMS questionnaires, questions, and structure of questionnaires please see Appendix.
The sets of PROMs were delivered by 4 modes: selfadministered in clinic and face-to-face interviewer led, both completed during the outpatient clinic appointment; and later via telephone interview and self-administered by post. Participants completing PROMs self-administered in clinic were asked to complete the set of questionnaires using pen and paper, without assistance. During the face-to-face interview, each question in the set was asked by a member of the research team and the questionnaires were completed by the researcher based on the verbal responses of the participant.

Randomization
The participants were randomized to the sequence in which mode of completion was done (Table 1). Participants were randomized in permuted blocks to 1 of 4 groups with 3 times block size.

Sample size
A sample size of 60 was calculated, based on the OHS. The minimal clinically important difference (MCID) for the OHS is 5 (Beard et al. 2015). Therefore, for 80% power and a 2-sided 5% significance, a sample size of 52 is required to allow the study to detect if there is a significant difference between the scores by different modes. This was rounded up to 60 participants to account for loss to follow-up.

Missing values
Missing values were identified before statistical analysis. For calculation of the PROM scores, missing values were dealt with according to the user guides for each score. For WOMAC, if 2 or more pain, both stiffness, or 4 or more physical function items were omitted, the participants' responses were deemed invalid and the deficient subscale was not used for analysis. According to OHS guidance, if more than 2 questions were unanswered, a score was not calculated. If 1 or 2 questions were unanswered, the mean value of other responses Study flowchart.

Analysis
There are 2 possible methods to analyze mode comparison studies. The first is to assume independence on each occasion a questionnaire is completed. This analytical standpoint is typically described as investigation of between-population differences. In this case, linear regression was used to estimate the between-population mean difference of each questionnaire by delivery mode of delivery. The second method of analysis investigates within-individual (paired) differences of each questionnaire by mode of delivery. In this case, the paired difference between each mode was calculated and linear regression was used to estimate any within individual differences between modes of delivery. Within-patient analyses were further adjusted for the time between completion of questionnaires (time delay) and age of the participant, i.e., < 70 or ≥ 70 years and old. P-values are reported without adjustment for multiple testing (Perneger 1998 Results 66 participants (median age 69 (IQR 62-77), 33 male) consented to participate (Figure).

Missing data
Missing data were observed for each of the WOMAC, OHS, and EQ5D measures. There were no incomplete SAPS questionnaires. More missing data were observed for the OHS than for any of the other scores (Table 2, see Supplementary data). This resulted in the exclusion of 5 OHS questionnaires, 3 of which were when the questionnaire was self-administered in clinic. In comparison, the WOMAC score was insufficiently completed for the inclusion of all 3 subscales in 3 cases and 1 EQ5D questionnaire was excluded. Scores were left unan-swered most frequently when self-administered in clinic (29 of 2,970 data points).

Between-population mode differences
There were no statistically significant differences among mean scores by mode of delivery for any PROM investigated (Table  3, see Supplementary data).

Timing between administration
There was no difference between administrations 1 and 2, as these were undertaken on the same day, before and after an outpatient appointment. The median time difference between administration 2 and 3 was 7 (IQR 3-8) days and 6 (IQR 3-7) between timepoints 3 and 4. The median time between administrations 1 and 4 was 14 (IQR 8-17) days.

Within-individual (paired) mode differences WOMAC subscales
When the WOMAC subscales were considered, there was a difference in the individual paired differences observed for the stiffness subscale between the modes of delivery (Table 4). This persisted when adjustment was made for age and the time delay between modes of delivery. Young patients were more likely to give a higher (worse) score when the score was completed in clinic by face-to-face interview or self-administered than when the score was completed by postal or telephone modes. The WOMAC function subscale revealed a similar pattern with higher (worse) scores given when the score was completed in clinic by face-to-face interview or self-administered and when the form was self-administered and delivered by post compared with telephone interview, but this difference disappeared when adjusting for the time delay between modes and age.

OHS
The individual paired differences between the OHS scores for different modes showed a statistically significant difference between postal and telephone scores when unadjusted and adjusted for time delay. However, this may not be clinically relevant as the difference is below the MCID of 5. When the OHS was adjusted for age and the time delay between modes of delivery, lower (worse) scores were given when the form was completed in clinic by face-to-face interview (-7.35, 95% CI -11 to -4) or self-administered by post (-1.24, 95% CI -2.4 to -0.07) compared with telephone interview completion.

EQ5D
When the EQ5D was considered, no differences in the individual paired differences were seen. However, when adjusted for time delay between questionnaires, higher (better) scores were seen in the telephone and postal groups compared with the other modes of delivery. When adjusting for the age of the respondents, this difference did not persist. When age   Model 1 = Individual paired differences (IPD), Model 2 = IPD + time delay, Model 3= IPD + time delay + age. F2F: face to face in clinic; SI: Self-administered in clinic; P: postal; T: telephone interview. Dif: Individual paired difference: 95% CI: 95% confidence interval.
was categorized into young (< 70) or old (≥ 70), the difference persisted in the older group, suggesting older patients are more likely to report a higher (better) score for the EQ5D when completed by telephone interview or self-administered and delivered by post.

SAPS
There was a statistically significant reduction in the individual paired differences for the SAPS between self-administration in clinic and face-to-face interview, and between self-administration in clinic and delivery by post or telephone interview. This difference persisted when adjusting for time delay between modes and age. Each point on the SAPS scale has a value of 6.25%. The difference between groups was roughly 4%, thus the effect was less than 1 point in 1 response, and of unlikely clinical significance.

Discussion
We investigated whether the mode of delivery of commonly used PROMs affected the results reported by patients who had undergone primary total hip replacement. There were no statistically significant between population differences among mean scores in any PROMs investigated.
We observed more missing data for the OHS than other PROMs. This included 3 OHS questionnaires that had no responses completed. The OHS was the last questionnaire in the series. The higher missing data count in this score may be due to survey fatigue, or perhaps, most simply, that patients omitted the last page of the questionnaire (Porter et al. 2004). Most missing data were seen in the self-administered in clinic group across all PROMs.
In a systematic review of mode of administration of surveys, Bowling (2005) described 13 sources of bias that may affect the results obtained by different modes. That study concluded that non-response is likely to be influenced by mode of administration, with a higher non-response reported in postal than face-to-face and postal than telephone, suggesting that premature termination is less likely in the presence of a motivating interviewer.
This finding is echoed by Wood and McLauchlan (2006) and by Fitzpatrick et al. (2000) in the response rates to the OHS, with highest responses achieved by face-to-face and selfadministered questionnaires and lowest with postal responses. Both studies reported that question 6 of the OHS ("In the past 4 weeks, for how long have you been able to walk before pain from your hip becomes severe (with or without a stick)?") was the one most frequently left unanswered.
WOMAC subscales did not reveal any statistically significant difference in mean scores across mode of delivery. When adjusted for time delay and the age of the participant, young patients showed a small propensity to worse scores on the WOMAC stiffness subscale when the score was completed in clinic by face-to-face interview or self-administered. This difference was between 5% and 13%, below the MCID of 25% and is therefore not likely to be clinically significant (Quintana et al. 2005). No statistically significant differences were seen in WOMAC pain scales by any mode of delivery. These findings are similar to those of Bellamy et al. (2002), which showed no difference between telephone and onsite administration for the WOMAC knee score and electronic versus paper surveys for patients with hip and knee OA in 2002.
OHS gave a higher (better) score for telephone than postal scores or face-to-face interviews, with a difference between face-to-face and telephone of 7 points, which is in excess of the MCID and therefore could be clinically significant (Murray et al. 2007). Older participants may give better scores for EQ5D if recorded by post or telephone interview. However, quantifying the magnitude is difficult with the EQ5D index. A small reduction in satisfaction was seen when SAPS was completed by self-administered in clinic when compared with face-to-face interview and telephone interviews and postal responses. These findings suggest a small propensity to better score responses in telephone questionnaires than other modes. Telephone interviews may be subject to biases including social desirability bias, yes-saying, and interviewer bias (Bowling 2005). Indeed, it has been shown that satisfaction may be more positive if surveys are presented aurally than visually (Dillman et al. 2009). Health-related quality of life scores have been shown to be consistent when mode of administration was the same but telephone administration of EQ5D yields more positive results than self-administered in an older group with both US and UK weighting (Hanmer et al. 2007). Hays et al. (2009) found that the maximum effect size between postal versus telephone administration of the EQ5D was 0.5. Wood and McLauchlan (2006) found no difference in mode of administration between postal delivery and interview of the OHS at 10-year follow-up after THR. The OHS showed only a small increase in mode effect average, 1.2, in telephone versus postal administration, which was not deemed clinically relevant (Messih et al. 2014). However, in a meta-analysis of mode of administration of PROMS, self-completion and assisted completion produced equivalent scores overall but results were influenced by the setting in which questionnaires were completed (Rutherford et al. 2016).
We found that mode of administration of PROMs 1 year after THR may have small effects on the results obtained. Participants in our study were randomized according to the order of modes; therefore, we do not believe practice effects are likely to explain these differences. Telephone and postal responses were collected following self-administration in clinic and face-to-face interview in clinic. As such, a time delay between these sets was introduced. This had an effect in some cases but was adjusted for within the analysis. Although participants were encouraged to complete the self-administered PROMs themselves, whether they received help or support from friends, family members, or carers was not docu-mented. Interviewer-led PROMs were completed by several different researchers, which may introduce bias to the answers obtained. Comorbidity and the indication for THR were not investigated as part of this study. Our findings may be generalizable to patients who have undergone THR but not those awaiting THR and only apply to English versions of these PROMs.
We did not investigate electronic modes of delivery of these PROMs, such as via mobile phone, hand-held devices in clinic and automated telephone responses. In an increasingly digital age, these modes of delivery are increasingly used and the effects are as yet unknown in this population.
The small variations in responses to PROMs delivered by different modes of administration are not likely to have a significant impact on the results in smaller studies. However, in large longitudinal studies where the timing of questionnaires may vary, these small biases may induce statistically significant chance findings. To ensure bias is minimized in studies using PROMs assessment after THR, we recommend that modes of administration are, wherever possible, not mixed in a single study. If multiple modes are used it is important to distinguish between modes of administration, and avoiding mixing self-administered and interviewer-led PROMs may minimize the effects. When outcomes are collected by different modes of administrations this should be acknowledged, and care should be used when interpreting results. Whilst using different modes in the same study may be useful in minimizing missing data in clinical studies, it is important to recognize it is not a panacea, and the primary response is still missing.

Supplementary data
Appendix and Tables 2 and 3 are available as supplementary data in the online version of this article, http://dx.doi.org/10. 1080/ 17453674.2018.1521183 MRW: Conception and design of the study. JB and AS: Design and acquisition of the data. JB, and JJ: Acquisition of data. CC, and AS: Analysis of data. All authors interpreted data and wrote the report. CC, JB, AS, MW contributed equally to this work The following people contributed to the paper through the acquisition of data: Christopher Woodrow, Harriet Mitchell, Sophie Stanger, Samantha Dixon, Nicolas Toosi, and Charlotte Howie.