Psychometric properties of the PHQ-9 measure of depression among Brazilian older adults

Abstract Objectives: To obtain evidence on the psychometric properties of the Patient Health Questionnaire − 9 (PHQ-9, one of the most extensively used tools for assessing depression) in the Brazilian older population. Method: Data on 3,356 Brazilian adults aged 60+ years living in Guarulhos, São Paulo state were used. The factor structure of the questionnaire was analysed using a factor analysis approach. The questionnaire’s measurement equivalence was tested across gender, age, personal income, and education level groups. The scores were compared across groups based on the highest level of equivalence achieved. The questionnaire’s internal consistency was analysed considering its factor structure. Results: A one-factor solution was identified as the most adequate factor structure, with the factor explaining 57.6% of the items’ variance. The correlation of the resulting latent score with the overall raw sum score in the PHQ-9 was r = 0.96. Measurement equivalence regarding thresholds and loadings was achieved for all tested groups. On average, women, older, less educated, and poorer people had higher latent scores on the depression factor. The measure showed a good internal consistency with Revelle’s omega total ω t =0.92. Conclusion: The results suggest that, among Brazilian older adults living in Guarulhos, São Paulo state, the PHQ-9 measures depressive symptomatology equivalently across different sociodemographic subgroups. Moreover, it can be scored using the raw sum of the item scores to adequately reflect different levels of depressive symptomatology.


Introduction
Depression is one of the leading causes of disability and global burden worldwide (Vigo et al., 2016;World Health Organization, 2017). A recent meta-analysis (Moreno-Agostino et al., 2021) shows that there is a predominantly increasing trend in the prevalence of depression in population-based studies using equivalent sampling and depression assessment procedures across time points. This interacts with a global shift in demographics, characterised by an increase in the proportion of the older population worldwide (United Nations, 2017). In low-and middle-income countries (LMICs), evidence suggests that depression in older adulthood is both common and burdensome (Guerra et al., 2016). This is the case in Brazil, where the estimated prevalence of depression in adults aged 60-64 years is 11.1%, 9.9% in those aged 65-74, and 6.9% among those aged 75+ (Stopa et al., 2015).
Efforts are being made to develop feasible and scalable interventions to address late life depression in Brazil (Scazufca et al., 2019(Scazufca et al., , 2020, considering the limited resources and reduced ratios of mental health professionals. Adequate measures of depression in older adulthood are needed to ensure the clinical relevance of the assessments (i.e. in relation to depressive symptomatology levels), as well as their comparability across relevant subgroups (e.g. different genders, age groups, and educational levels), both in interventional and observational designs.
The Patient Health Questionnaire − 9 (PHQ-9) (Kroenke et al., 2001) is an extensively used tool for assessing depression. It can be employed as a screening tool for depression, and also to ascertain levels of and changes in depressive symptomatology (Kroenke et al., 2010). There is an extensive body of literature providing evidence on the psychometric properties of the PHQ-9. Some studies have focused on the diagnostic validity of the PHQ-9 (i.e. its ability to differentiate between people with and without depression), supporting the use of a score of 10+ as an adequate classification threshold for major depression (Kroenke et al., 2010;Manea et al., 2015). Other studies have analysed the PHQ-9 factor structure, showing the existence of a single underlying factor structure (Cameron et al., 2008;González-Blanch et al., 2018;Keum et al., 2018;Kocalevent et al., 2013;Titov et al., 2011) or a two-factor structure (Beard et al., 2016;Chilcot et al., 2013;Elhai et al., 2012;Krause et al., 2010;Richardson & Richards, 2008) with a somatic factor (typically accounting for sleep, energy, and appetite disturbances) and a cognitive/affective factor (usually accounting for the depressed mood, loss of interest in activities, feelings of worthlessness, and thoughts of death and self-harm).
Notwithstanding these studies, there is a paucity of evidence on the PHQ-9 psychometric properties in the Brazilian population and no evidence focused on older Brazilian adults. To our knowledge, the only two studies that have been conducted cover validity in the general population (Santos et al., 2013) and among women in primary care settings (de Lima Osorio et al., 2009), supporting the use of a 9+ and 10+ cut-off score respectively for assessing depression. Thus, there is no available evidence on the factor structure of the PHQ-9 and the degree to which it measures depression equivalently across different subgroups of Brazilian older adults.
This lack of evidence hinders the interpretation of the resulting PHQ-9 scores as an adequate reflection of the depression levels. Moreover, it precludes confidence in the comparability across older adults with different genders, age groups, or educational levels, since different scores may be distorted by differences in other aspects rather than in the underlying levels of depression (for instance, differences across groups in the relative relevance of the different items).
This study aims to provide evidence of the factor structure of the PHQ-9 questionnaire in a sample of Brazilian older adults, as well as to analyse potential measurement differences across older adults of different genders, age groups, and educational levels.

Sample and procedure
The sample included the 3,356 adults aged 60+ years old who participated in the screening stage of the first wave of the PROACTIVE cluster randomised controlled trial (Scazufca et al., 2020), interviewed between May 2019 and February 2020. Participants lived in Guarulhos, in the São Paulo metropolitan region. Among the 69 existing Unidades Básicas de Saúde (UBSs, Basic Health Units) in Guarulhos, the 24 containing at least four Family Health Teams were selected.
After randomly selecting four of them as reserve (two from each of the strata defined by the median percentage of individuals either with no education or only a literacy programme for adults), those aged 60+ years enrolled in the remaining 20 UBSs (10 in each of the intervention and control arms of the trial) were contacted using a randomly ordered list. In the first contact, a screening questionnaire including the PHQ-9, as well as basic sociodemographic information including age, gender, income, and educational level, was administered to all participants. The aim of this screening phase was to identify people with depression (operationalised as a PHQ-9 score of 10+) for the PROACTIVE trial. Those participants who fulfilled the inclusion criteria and did not meet any of the exclusion criteria (i.e. individuals whose partner was already included in the study, individuals with acute suicidal risk, hearing or vision loss, or unable to communicate, and individuals unable to engage in the trial for a 12-month period) were asked to participate in the trial. Additional information was then collected in a baseline questionnaire.
All participants provided either oral or written consent to participate. The study was approved by the Ethics Committee of the University of São Paulo Medical School (CEP FMUSP number 2.836.569), and authorised by the Guarulhos Health Secretary.

PHQ-9
All participants completed the PHQ-9 questionnaire (Kroenke et al., 2001), which is a development of the PRIME-MD patient health questionnaire (Spitzer et al., 2000). The version of the questionnaire used in this study was originally translated into Brazilian Portuguese by Fraguas et al. (2006) and is available online (https://www.phqscreeners.com/). The PHQ-9 comprises nine items, each of them covering one of the symptoms of depression according to the DSM-IV criteria (American Psychiatric Association, 1994). The participants were asked how often they experienced each of those symptoms over the last two weeks. The response options to these questions were 'Not at all' (0), 'Several days' (1), 'More than half the days' (2), and 'Nearly every day' (3). We used the number of days response set ('0-1 day' , '2-6 days' , '7-11 days' , '12-14 days') whenever the participant had any difficulty understanding the standard verbal response options. As the level of literacy among older adults in Brazil is generally low, the research assistants read aloud the PHQ-9 questions to all study participants. These alternative ways of applying the PHQ-9 have been reported previously and have been found acceptable (Kroenke et al., 2010). The responses to all nine items are then summed, resulting in an overall score that ranges from 0 to 27. Scores of 10+ are commonly used as a screening criterion for depression.
Sociodemographic variables. Information on the age, gender, education, and personal income was collected as part of the screening questionnaire. Education was assessed as the grade studied and recoded into the following groups reflecting the number of years studied: 'None' (including those who did not study or just attended literacy for adults), '1-4 years' (including those who studied primary school up to the 4 th grade), '5-8 years' (including those who attended primary school up to the 8 th grade), and 'More than 8 years' (including those who attended secondary or technical school, or higher education). Personal income was assessed as the monthly number of minimum wages (corresponding to R$998.00 at the time of the study), and recoded into the following groups: 'Up to 1' , 'Between >1 and 2' , and 'More than 2' .

Statistical analyses
To allow for the use of a cross-validation approach, the overall sample was randomly split into two equal-sized groups. The first half of the sample was used to implement an exploratory factor analysis (EFA) approach. First, we explored the adequacy of the data to perform the EFA by means of the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy and Bartlett's test of sphericity, as well as the visual inspection of the correlation matrix among the items. Due to the nature of the PHQ-9 response options (ordered categories), the polychoric correlation matrix was used. Provided the preliminary results were satisfactory, a parallel analysis was performed using 50 simulated datasets, extracting the eigenvalues with a principal components analysis. We compared the 95 th quantile of the simulated eigenvalues with the observed eigenvalues, retaining those factors whose observed eigenvalues were higher than the simulated ones (Hayton et al., 2004). A scree plot with the observed and simulated eigenvalues was also used to guide the decision of the optimal number of factors to initially extract, retaining the number of factors up to the point where the eigenvalues begin to level out.
Then, EFAs with increasing number of factors were computed over the polychoric correlation matrix, using a weighted least squares (WLS) extraction method and an oblique rotation method (PROMAX). The different solutions were then compared based on their fit indices [including χ 2 , Root Mean Square Residuals (RMSR), Root Mean Square Error of Approximation (RMSEA), Tucker-Lewis Index (TLI), and Bayesian Information Criterion (BIC)], the meaningfulness of the factors (their interpretability and the percentage of items' variance accounted for) and the parsimony of the model. Models with lower χ 2 , RMSR, and BIC values would reflect a better fit to the data, whereas RMSEA values below 0.060 and TLI values above 0.950 are often considered indicators of good fit based on commonly-used guidelines (Hu & Bentler, 1999).
Once the most adequate factor solution was identified, it was cross-validated in the second half of the sample by means of a confirmatory factor analysis (CFA) approach. Again, due to the nature of the items, a Diagonally Weighted Least Squares (DWLS) estimation was used, obtaining mean and variance adjusted (WLSMV) fit indices. The goodness of fit of the CFA model was assessed by means of the rescaled fit indices [χ 2 , Comparative Fit Index (CFI), TLI, and RMSEA], which were interpreted using Hu and Bentler (1999) guidelines.
We then proceeded to analyse the measurement equivalence of the PHQ-9 across genders, age groups (60-69, 70-79, and 80+ years), and education and personal income levels. To achieve this, we used a measurement equivalence (measurement invariance) strategy implemented by means of multiple group CFA. This strategy is based on the comparison of models with increasing numbers of constraints to ascertain the extent to which the measures obtained across groups can be considered equivalent. Due to the ordered nature of the PHQ-9 items, we based our analytical strategy on the recent developments and recommendations on measurement equivalence testing with ordered categorical data (Svetina et al., 2020;Wu & Estabrook, 2016). Details on the procedure can be found in the Appendix S1 (supplementary material).
The resulting factor scores were compared across categories of the different groups and interpreted considering the highest level of measurement equivalence achieved in the previous analyses. Finally, a measure of the internal consistency of the PHQ-9 questionnaire was obtained in the overall sample, considering the underlying factor structure found in the EFA analyses (Revelle & Zinbarg, 2009).
The random subsample for the EFA comprised 1,623 participants (48.4% of the overall sample). Bartlett's sphericity test was χ 2 (36)=9,192.35 (p < 0.001), whereas the KMO measure of sampling adequacy was 0.93. Both these values, along with the visual inspection of the polychoric correlation matrix (Table S1, supplementary material), which displayed positive correlations ranging from 0.40 to 0.76, suggested the adequacy of the data for performing EFA. The results of the parallel analysis are shown in Table S2 (supplementary material), and the resulting scree plot is shown in Figure S1 (supplementary material). A first component explained 62.14% of the variance, the ratio of the first to the second observed eigenvalues was 8.34, and the size of the observed eigenvalues dropped below the 95 th percentiles of the simulated eigenvalues after the first component, plateauing towards the maximum number of components. According to these results, a single factor should be retained.
We therefore extracted a single factor in the first EFA, and performed additional EFAs with increasing number of factors up to five. The fit indices of the EFA models are included in Table  1. Models with up to four factors showed increasing levels of fit to the data, with that fit decreasing from the inclusion of a fifth factor. The models with three and four factors showed an adequate fit based on the usual guidelines (TLI higher than 0.950, RMSEA lower than 0.060, in this last case only slightly for the three-factor model). However, the resulting factors in both the three-and four-factor solutions were highly correlated with each other (all correlations above 0.73), and the RMSR was acceptable in all solutions. Moreover, the increased complexity of these models compared with the one-factor solution did not translate into a greater proportion of variance explained (57.6%, 48.1%, and 55.2% in the one-, three-, and four-factor solutions, respectively). Additionally, all factor loadings in the one-factor solution were substantively high, with a minimum loading of 0.64. Considering these statistics, and to use the most parsimonious and generalisable model, the one-factor model was selected as the optimal one.
The random sample for the CFA analyses included 1,733 participants. The unidimensional CFA model showed acceptable fit with the following mean-and variance-adjusted fit indices: χ 2 (27)=202.73 (p < 0.001); CFI = 0.985; TLI = 0.980; RMSEA = 0.061 (p = 0.009). The correlation of the latent score with the overall raw sum score in the PHQ-9 in this subsample was r = 0.96.
The results of the multigroup CFA models performed to analyse measurement invariance of the PHQ-9 are shown in Table  2. Invariance of the thresholds and loadings was achieved in all cases (across different genders, age groups, and education and   personal income levels), meaning that the resulting PHQ-9 scores were comparable across these groups. Additionally, the factor scores resulting from the main CFA model were compared across the groups in the CFA subsample in order to provide evidence of the known-groups validity (i.e. the extent to which scores differ across the groups as would be expected). The results of these comparisons are included in Table 3. On average, women, older, less educated, and poorer people had higher latent scores on the depression factor. Finally, considering the single factor structure of the PHQ-9 in the sample, Revelle's omega total (ω t ) was computed as a measure of internal consistency, which resulted in ω t =0.92.

Discussion
This is the first study to provide evidence on the factor structure, measurement invariance, known-group validity, and internal consistency of the widely used PHQ-9 questionnaire among Brazilian older adults living in urban areas that are predominantly socioeconomically deprived.
Regarding the factor structure, our study suggests that, in this population, the questionnaire has an essentially unidimensional structure. This structure has been previously reported in other populations (Cameron et al., 2008;González-Blanch et al., 2018;Keum et al., 2018;Kocalevent et al., 2013;Titov et al., 2011). A two-factor structure reflecting a difference between somatic and non-somatic (more cognitive and affective) symptomatology has also been widely reported in the literature (Beard et al., 2016;Chilcot et al., 2013;Elhai et al., 2012;Krause et al., 2010;Richardson & Richards, 2008). However, we found that increasing the number of factors did not result in a substantially better explanation of the shared variance across the items, and thus we selected the simplest solution (one factor) for the sake of parsimony and generalizability of the model. Moreover, we found that the latent factor score resulting from considering a single depression dimension correlated very highly with the raw sum of item responses following the scoring instructions of the questionnaire (Spitzer et al., 2000). Altogether, our findings support the appropriateness of this latter (and simpler) scoring approach, in which all items reflect a single depression dimension and contribute equally to the resulting score.
Regarding the degree of measurement invariance of the PHQ-9 across different demographic and socioeconomic groups, we found that the questionnaire performed equivalently among Brazilian older adults regardless of their gender, education, age group, and personal income. Differences in the resulting scores across these subgroups are therefore due to differences in the underlying depression levels and not to other aspects (e.g. the relative importance of different items or the way in which different symptoms' intensities contribute to the underlying latent factor of depression). These findings are in line with previous studies showing similar invariance features in other populations (González-Blanch et al., 2018;Keum et al., 2018), providing further justification for using this measure to derive meaningful comparisons of the depression levels across these subgroups in the Brazilian older population.
Our study also provides evidence on the known-group validity of the PHQ-9 questionnaire by showing the expected differences in the depression levels across these different demographic and socioeconomic groups, once the potential existence of measurement differences has been ruled out. Our results are in line with previous studies in the general (Alonso et al., 2004;Lorant et al., 2003) and older (Cole & Dendukuri, 2003;Fiske et al., 2009) population by showing that female, less educated and poorer older adults show higher levels of depressive symptomatology than their male, more educated and higher income counterparts.

Strengths and limitations
The results of this study must be interpreted in the light of several strengths and limitations. Among the strengths, we used information on a large sample of Brazilian older adults living in a relatively socioeconomically deprived area of the São Paulo state. By focusing our study in a LMIC population, we narrow the existing divide in the research (and, more specifically, mental health research), which disproportionately over-represents populations from high-income countries, accounting for less than 20% of the global population (Saxena et al., 2006;World Bank, 2021). Moreover, the random sampling procedures and the large sample size used make our results easier to generalise to the reference population. From a methodological point of view, we used a robust approach to the analysis of tests with ordered categorical items and updated guidelines for the analysis of the measurement invariance of such tests, overcoming the limitations of previous research using less appropriate analytical approaches developed for the analysis of continuous measures (Svetina et al., 2020;Wu & Estabrook, 2016).
Notwithstanding this, several limitations should be considered when interpreting our results. First, despite the strengths derived from the sampling procedures and resulting sample, the results may not be generalisable to the overall Brazilian older population, which also includes older adults living in different urban and rural areas, where the context characteristics (e.g. socioeconomic conditions or even access to mental health services) may be substantially different. Second, due to the available information in the administered questionnaires, we could not consider alternative economic variables (e.g. household income or household wealth) that may have more adequately reflected the economic conditions of older adults than the personal income. Likewise, we could not perform the measurement invariance analysis by other relevant subgroups (e.g. ethnicity) nor include additional variables such as disability, anxiety, loneliness, or additional measures of depression that may have been relevant for the concurrent and discriminant validity analyses (Cole & Dendukuri, 2003;Hooker et al., 2019). Information on some of these variables was only available for a subset of the participants who presented at least moderate levels of depression symptomatology (PHQ-9 ≥ 10) and who consented into the randomised trial; the inevitable lack of generalisability to the wider population meant that these variables were therefore not analysed further for the purposes of this study. Nevertheless, the results found using personal income and other sociodemographic variables are in line with the previous literature, thus supporting the known-group validity of the PHQ-9 questionnaire.

Conclusions
Overall, our study provides evidence of the adequacy of the PHQ-9 questionnaire for assessing the depression levels of Brazilian older adults living in urban areas. Our results support the use of the widely used and relatively simple scoring procedures (the raw sum of item responses) to adequately reflect the depressive symptomatology levels in this population. Moreover, they suggest that the PHQ-9 measures depression equivalently across relevant sociodemographic subgroups, opening the way to the study of differences in the PHQ-9 scores after ruling out the potential effect of the lack of measurement invariance across these groups.