Use of anchoring vignettes to evaluate health reporting behavior amongst adults aged 50 years and above in Africa and Asia – testing assumptions

Background Comparing self-rating health responses across individuals and cultures is misleading due to different reporting behaviors. Anchoring vignettes is a technique that allows identifying and adjusting self-rating responses for reporting heterogeneity (RH). Objective This article aims to test two crucial assumptions of vignette equivalence (VE) and response consistency (RC) that are required to be met before vignettes can be used to adjust self-rating responses for RH. Design We used self-ratings, vignettes, and objective measures covering domains of mobility and cognition from the WHO study on global AGEing and adult health, administered to older adults aged 50 years and above from eight low- and middle-income countries in Africa and Asia. For VE, we specified a hierarchical ordered probit (HOPIT) model to test for equality of perceived vignette locations. For RC, we tested for equality of thresholds that are used to rate vignettes with thresholds derived from objective measures and used to rate their own health function. Results There was evidence of RH in self-rating responses for difficulty in mobility and cognition. Assumptions of VE and RC between countries were violated driven by age, sex, and education. However, within a country context, assumption of VE was met in some countries (mainly in Africa, except Tanzania) and violated in others (mainly in Asia, except India). Conclusion We conclude that violation of assumptions of RC and VE precluded the use of anchoring vignettes to adjust self-rated responses for RH across countries in Asia and Africa.

on an increasing Likert scale from 'poor' to 'excellent' health. Such ordered ordinal responses are analyzed with the assumption of an underlying latent interval scale. For such analyses, the tendency is to treat one person's categorization or rating response to be the same as that of another person and assume that both understand the response categories in the same way. In other words, we assume that individual's self-rate their response using the same cut-off points or thresholds on the latent interval scale which differentiate the categories 'poor', 'fair', 'good', or 'excellent' on the manifest scale. However, there is a large body of evidence to suggest that individuals or groups of individuals interpret and choose categories in vastly different ways. Two individuals or groups of individuals with identical health levels may rate their own health differently based on their understanding, experience, and expectation of their own health (6). This difference in reporting style or reporting behavior is referred to as response-category differential item functioning (7) or reporting heterogeneity (RH) (8). RH has been seen across sexes (9), socio-economic strata (10), race and ethnicities (11,12), and countries (13Á16). Unless recognized, such RH can result in misleading and incorrect interpretations (7,17).
In recent years, 'anchoring vignettes' has been shown to be a promising strategy to overcome the problem of RH in survey questions (14,18). Anchoring vignettes are brief texts describing a hypothetical character who exemplifies a certain fixed level of the trait of interest. The respondent is asked to rate the level of the trait for the vignette character as she/he would do for his/her own. The vignette ratings are used to identify the problem of RH and then adjust the self-rating response by removing its systematic variation using either a parametric or nonparametric approach (8, 18Á20). 'Anchoring vignettes' method has increasingly been used to improve interpersonal and cross-cultural comparability of survey questions in areas of political efficacy, work disability, job satisfaction, life satisfaction, health and health system responsiveness (7, 8, 21Á26).
The anchoring vignettes approach requires two fundamental assumptions to be met Á vignette equivalence (VE), that is, all respondents understand the health state described by a vignette in the same way; and response consistency (RC), that is, a respondent uses the same thresholds to rate vignettes as she/he does to rate his/her own self. The VE assumption allows for the identification of RH, if any, while the assumption of RC is necessary for adjusting self-rating responses for RH. Violation of either assumption precludes the use of anchoring vignettes to correct self-rating responses for RH. Initial studies have used informal checks to assess inconsistencies in rank ordering of vignette severity or less stringent non-parametric methods such as testing for systematic difference in vignette rankings to evaluate these assumptions (19,26). Analytic methods are now developed to allow a more rigorous evaluation of measurement assumptions using parametric methods (20, 27Á29).
The World Health Organization (WHO) study on global AGEing and adult health (SAGE) conducted at eight surveillance sites of the International Network for the Demographic Evaluations of Populations and their Health (INDEPTH) Network aims to compile comprehensive longitudinal data on the health and well-being of older adult and elderly populations across different lowand middle-income countries (30). In this article, we use the SAGE data on self-ratings and vignettes in mobility and cognition to test the assumptions of VE and RC that are essential for the use of the anchoring vignettes approach.

Ethics statement
The Ethics Review Committee of the WHO, Geneva and respective Ethics Committees of the participating Health and Demographic Surveillance System (HDSS) sites of the INDEPTH Network approved the WHO SAGE. All respondents participated in the study after having completed an informed written consent.
SAGE data SAGE has adapted and built further on to the methods and instruments developed by the WHO for the World Health Survey that was conducted in 2002Á03 in 70 countries. The SAGE questionnaire was pre-tested in 2005 amongst 1,500 respondents in India, Ghana, and Tanzania. The WHO's collaboration with the INDEPTH Network supported eight HDSS sites in Africa (Navrongo, Ghana; Nairobi, Kenya; Agincourt, South Africa; Ifakara, Tanzania) and Asia (Matlab, Bangladesh; Vadu, India; Purworejo, Indonesia; Filabavi, Vietnam) to implement an adapted summary version of SAGE (31). Three of these sites (Navrongo, Agincourt, and Vadu) also implemented the full version of SAGE in a smaller subset of its population. All sites represent predominantly rural populations except the urban slum site of Nairobi, Kenya. The cognitive ability of respondents to understand terms and concepts such as self-rating and vignette rating was ascertained at the start of the interview. Show cards were provided to aid respondents in their rating responses on the fivepoint Likert scale. Proxy respondents who knew the respondent well enough were identified and interviewed on behalf of the respondents with impaired ability to respond. A subset of respondents was re-tested for data quality assurance.
The summary version of SAGE included two self-rating questions on difficulty in functional ability in each of the eight domains (mobility, cognition, affect, self-care, vision, pain, sleep, and interpersonal relationships). These data were enhanced by linking with socio-demographic characteristics (age, sex, marital status, socioeconomic status (SES), family size, etc.) from each of the HDSS.

Vignettes data
The vignettes were administered as part of the summary version of SAGE by all sites, except for Navrongo and Agincourt, which administered vignettes only as part of the fuller version of SAGE. Each domain included two self-rating questions (one for a lower and another for a higher level of functional ability) followed by five vignettes adapted from the WHO World Health Survey describing varying levels of severity of limitation of function (Appendix 1). The names of the hypothetical persons in the vignettes were chosen to be related to the same sex as the respondent and culturally appropriate. Respondents were advised to think of the hypothetical person's experience in the vignette as if they were their own. The vignette rating questions were identical to the two self-rating questions replacing 'self' with the name of the hypothetical person in the vignette. Vignettes were paired into four domain sets (mobility and affect; pain and relationships; sleep and vision; and care and cognition). The selected respondents were randomly allocated to four groups and one set of paired domain vignettes was administered to each group. The vignettes in a set were administered in no particular order of domain or severity. Respondents assessed the functional ability of their own self and that of the hypothetical persons in the vignettes, on a five-point ordinal scale of increasing difficulty (no difficulty, mild, moderate, severe, and extreme difficulty).

Objective health measures
The fuller version of SAGE, in addition to the summary version, included some objective measures. Mobility was assessed by the time taken to walk four meters at normal and rapid speed. Handgrip strength (kg) was measured separately for both hands using Smedley's hand dynamometer. Cognition measures included immediate and delayed word recall, forward and backward digit span test, and verbal fluency. The average of the number of correct words recalled (where sequence did not matter) from a list of 10 words from 3 trials was taken as the score for the word recall test (maximum possible score 10). The length of the longest series of digits recalled by a respondent in the correct sequence was taken as the score for the forward and backward digit span test (maximum possible score 9). The number of animals listed by the respondent in 1 minute was taken as the score for the verbal fluency test. Each cognition test measure was rescaled from 0 to 1, with the higher score indicating higher cognition.
Sites implemented the summary version of SAGE either amongst all eligible older adults aged 50 years and above or on a random sample. Furthermore, the fuller version of SAGE was implemented in a smaller random subset of 500 adults aged 50 years and above at the Navrongo, Agincourt, and Vadu sites. For this article, we focus our analysis on the two self-ratings of mobility (difficulty in moving around, difficulty in performing vigorous activity) and cognition (difficulty in memory, difficulty in learning) as objective measures needed to test assumptions of vignettes were available for these domains.

Statistical methods Á testing assumptions
Consistency of orderings of the five vignettes was checked using the 'ANCHORS' package in R statistical programming language (32). Hierarchical ordered probit (HOPIT) models for testing VE and RC assumptions were developed in STATA. The VE test tested that there was no systematic variation in the perceived difference in the states described by any two vignettes. This was based on the observation that the perceived location (on the latent scale) of vignettes would be constant if VE held. We specified a HOPIT model for V * ij , the perceived location of vignette j by respondent i. To achieve model identification, we constrained the location of vignette severity level 5 to zero and estimated the locations of the other vignettes relative to the reference vignette. We included interaction terms between each vignette and covariate (e.g. between first vignette and age groups) and tested for all parameters of the vignetteÁcovariate interactions (Wald's test) to be equal to zero (global test for VE) (27). We also tested for individual covariate and vignettes interaction parameters to be equal to zero to determine which covariates influenced VE. We also assessed VE by a visual comparison across sites of the predicted locations of the vignettes stratified by site.
Testing for RC required information on objective measures in addition to vignettes data. Such objective measures were presumed to capture all the co-variation between the latent construct of interest and the observable characteristics that may influence RH. If so, then any systematic variation that was seen in self-assessment that remained after conditioning on these objective measures could be attributed to RH. We were only able to test for RC in Navrongo, Agincourt, and Vadu as objective measures needed to test the assumption were only available for these three sites. To test the assumption of RC, we compared the locations of response category thresholds estimated from vignette ratings with the threshold locations estimated from objective measures. To do this, we specified three HOPIT models Á model 1 specified the perceived location of the vignette; model 2 specified the perceived location of the latent self-rating from all objective measures; model 3 was a special case of model 1 (vignettes) and 2 (objective measures) combined where the response category thresholds were identical.
We then used likelihood ratio (LR) test to determine if model 3 was significantly different from models 1 and 2 together, for all covariates (global test for RC) and for each individual covariate to determine which covariate influenced RC violation. We also assessed RC across sites by a visual comparison of the thresholds predicted by the vignettes model and those predicted by the objective measure model.
For all HOPIT models, we normalized the location parameters by excluding the intercept and also allowed response category thresholds to vary by sex, age, and education (27). All model parameters were estimated by maximum likelihood.

Results
The eight sites together had an estimated population of 107,900 individuals aged 50 years and above under demographic surveillance. Of the 38,793 individuals who participated in SAGE, 36,170 (93%) were administered vignettes in the different domains Á 9,375 for mobility and affect; 8,788 for self-care and cognition; 9,205 for pain and relationships; 8,802 for vision and sleep. The Kenya site administered vignettes to a random sample of 781 out of 1,991 respondents, whereas vignettes could not be administered to 29 respondents in the Indonesia site. Selfrating responses were missing for less than 1%. About 4 and 7% of respondents were not administered the timed walk and the grip strength test, respectively, while the cognitive tests could not be administered in less than 1% of the respondents. The VE assumption was tested on 9,375 and 8,788 individuals who responded to the mobility and cognition vignettes, respectively. The RC assumption was tested on the subset of 293 and 373 individuals who were administered the objective measures of mobility and cognition, respectively. Table 1 describes the socio-demographic profile of the participants across the sites. The overall mean age of men was 63.5 years and that of women was 64.1 years. Participants from Kenya, Tanzania, Bangladesh, and Vietnam were significantly younger when compared to those from Ghana, while there was no significant difference in age between participants from South Africa, India, Indonesia, and Ghana. Overall, 47% of participants were men (range: 32% in South Africa to 65% in Kenya). Overall, 39% of participants had none or less than primary education; more than 90% in Ghana, South Africa, and India and only 10% in Vietnam. Overall, 13% of participants (about 11% in African sites, about 4% in India and Indonesia, and 25 and 29% in Vietnam and Bangladesh, respectively) rated their own health as bad or very bad. There were no clear patterns in self-ratings for difficulty in functional ability in any of the domains across sites though it appeared that overall Bangladesh reported higher difficulties compared with other sites. The Asian sites (except Bangladesh) reported significantly lower difficulty in moving around compared to the African sites. This pattern was less apparent for self-ratings for difficulty with vigorous activity. Similarly, it also appeared that Bangladesh reported higher difficulty with memory compared to other Asian and African sites. This pattern however was less apparent for self-ratings for difficulty with learning. Based on objective measures, participants from South Africa were significantly less agile (normal walk speed) compared to Ghana and India. However, there was no significant difference in mobility across the three sites as measured by rapid walk speed (Table 1). Participants from Ghana were significantly stronger (grip strength) compared to South Africa and India. Participants from Ghana had significantly better scores (immediate verbal recall test) compared to South Africa and India. There was no significant difference in scores across sites for all other cognition tests (except significantly lower scores on verbal fluency for participants from South Africa compared to Ghana).
Overall, participants rated vignettes consistent with their order of severity in the mobility domain across all sites (Appendix 2). Similarly, there were no instances of incorrect ordering of vignettes in the cognition domain across all sites except in Kenya where learning vignette severity level 4 was incorrectly rated lower than vignette severity level 3 and in India where both memory and learning vignette severity level 5 was rated lower than vignette severity level 4 ( Table 2). The proportion of ties between vignette pairs (especially for cognition vignette pair 4 and 5) was higher amongst Asian sites compared to Africa. However, there appeared to be no clear pattern of high proportion of ties between vignette pairs across sites either for mobility or for cognition.

Testing VE assumption
The mean vignette difficulty ratings in the mobility domain increased with increasing severity level of the vignette across all sites ( Table 2). This indicated that overall participants understood mobility dysfunction levels described by the vignettes in the same way across sites. This was also seen for the cognition vignettes for all sites except Kenya where the mean rating for learning vignette severity level 4 was lower than that of severity level 3 and in India where the mean ratings for both memory and learning vignette severity level 5 were lower than that for severity level 4 though these differences were not significant.
The assumption of VE was formally tested in 9,375 and 8,788 individuals across the eight sites in the domains of mobility and cognition, respectively. It was seen that the VE assumption was strongly violated across sites both in mobility and cognition domains (Table 3). However, when VE assumption was tested within each site, it was seen that it was not violated in Ghana, Kenya, South Africa, and India for mobility (p-value for global test .05).
Siddhivinayak Hirve et al. Individual characteristics which influenced the differential understanding of mobility vignettes were: (i) age in Vietnam; (ii) age and/or education in Tanzania and Indonesia; and (iii) age and/or sex in Bangladesh (Table 4).
In the cognition domain, the pattern was less apparent. The assumption of VE was not violated in Kenya, South Africa, Tanzania, and India for the memory vignettes. However, it was violated in Ghana and  South Africa and all the Asian sites except India for the learning vignettes which were driven largely by age and education, respectively. The individual characteristics which drove the violation of VE assumption were sex and education in Bangladesh, education in Indonesia, and age in Vietnam for cognition vignettes. However, a less stringent graphical way of testing the assumption of VE showed that there were minimal differences across sites in the predicted locations of each of the mobility vignette ( Fig. 1a and b). A consistent increasing trend in predicted location was also seen from vignette severity level 1 to vignette severity level 4 in reference to vignette severity level 5. In contrast, Tanzania and India had lower predicted locations for cognition vignettes compared to the other sites ( Fig. 1c and d).

Testing RC assumption
The assumption of RC was tested in 293 (Navrongo Á 148; Agincourt Á105; Vadu Á 40) and 373 (Navrongo Á 151; Agincourt Á 110; Vadu Á 112) individuals in the mobility and cognition domain, respectively, in the three sites that had administered mobility and cognition tests as part of the fuller version of SAGE. It was seen that the assumption of RC was strongly violated across sites (Table 5) and within sites ( Table 6) for both mobility and cognition driven by age, sex, and education. Figure 2 compares the location of predicted thresholds used by the three sites for rating vignettes and for selfrating as derived from objective measures for mobility and cognition. There was a marked difference in the location of the predicted thresholds (test for equality of threshold locations) as identified from both models in all the three sites for both mobility and cognition which suggested that within each site participants used thresholds differently when rating vignettes and self-rating thereby violating the RC assumption. However, when trend lines for the thresholds used for vignette ratings and the thresholds used for self-rating derived from the objective measures model are compared (visual test for equality of distance between thresholds), it was seen that their slopes were moderately similar for moving around and learning domains for India, whereas the regression line slopes were markedly different for Ghana and South Africa. This suggested that the assumption of RC may not be violated for mobility and learning domain in India if a less stringent test (equality of distance between thresholds) was used as compared to the more stringent test of equality of thresholds.

Discussion
Our study provides evidence of violations of assumptions of response consistence and VE when anchoring  (27,28), while others have shown adherence to these assumptions (9,20,21). The lack of adherence to assumptions in our study could be because individuals or groups of individuals understood vignettes differently and/or used different thresholds in rating vignettes and their own disability in mobility and cognition. This in turn could be a function of the wording of the anchoring vignette and the rating question, the context in which it was understood, and the level of understanding of the respondent of the five-point ordinal rating scale.
In this article, we analyzed vignettes in two distinct and dissimilar domains of physical and mental health viz. mobility and cognition. We showed that within a country context, older adults (mostly from Africa except Tanzania) understood mobility vignettes in the same way, while in some countries (mostly Asian except India), they understood them differently whereby the variability was driven by the influence of age, sex, and education. This pattern of similar or differential understanding of vignettes by countries was less apparent in the case of cognition vignettes. A less stringent way of testing VE assumption by visual comparison of predicted locations of vignettes suggested that mobility (but cognition less so) vignettes were understood in the same way by older adults from all countries. Finally, there was evidence of violation of the assumption  of RC both across countries and within country. However, a less stringent way of visual comparison showed that the RC assumption may not be violated for mobility and cognition vignettes for India. Overall, our study showed a pattern that mobility vignettes are probably better understood by older adults than cognition vignettes. We evaluated the 'informativeness' of each possible set of vignettes by estimating the 'minimum entropy' function (results not shown). Both assumptions were still violated even with a smaller subset of vignettes. Collapsing the response categories from five to fewer categories may improve the possibility of the assumptions being met. However, this strategy would be valid if adopted a priori as the response category thresholds used by respondents on a four-point ordinal scale may not necessarily be the same as the thresholds derived by collapsing a five-category response to a four-category response post-priori. We also chose not to use non-parametric or parametric statistical models which required less strict assumptions (19,26) to ensure that the assumptions of VE and RC were met.
Our study was limited by the smaller samples available for testing the assumption of RC compared to VE and that this assumption could only be tested in Ghana, South Africa, and India. When we tested RC assumption, that is, compared the model which predicted thresholds used for rating vignettes with the model which derived thresholds based on objective measures to see whether participants used the same thresholds for self-rating and vignette rating, we presumed Á justifiably or otherwise Á that the objective measures of mobility (normal walk speed, etc.) and cognition (verbal recall, etc.) would capture all the co-variation between the latent mobility and cognition, and the observable characteristics that may influence RH. If so, then any remaining systematic variation seen in self-rating after conditioning on these objective measures could be attributed to RH. We used vignettes adapted from the World Health Survey of 2003, which had been implemented in 70 countries; further research is needed to see if revising the contents and wording of the vignettes (especially for memory and learning function) improves the performance of vignettes both from the perspective of VE as well as RC.
Despite the time and effort, vignettes are important as they provide information on whether individuals or groups of individuals use different thresholds to rate health. Assuming that the health level described by a vignette is understood in the same way by individuals (VE), vignette ratings will identify RH; and assuming that individuals will use the same thresholds to rate vignettes as they rate their own health (RC), vignette ratings will allow the self-rating of their own health to be adjusted for RH. These are essential requirements before any self-rated health function can be compared between individuals or groups of individuals.