The assessment of psychopathology among traumatized refugees: measurement invariance of the Harvard Trauma Questionnaire and the Hopkins Symptom Checklist-25 across five linguistic groups

ABSTRACT Background: Questionnaires are widely used to assess the mental health status of refugees, whereas their construct validity largely remains unexplored. Objective: This study examined the construct validity of two widely-used instruments for the assessment of PTSD symptoms (Harvard Trauma Questionnaire [HTQ]; 16 items) and symptoms of anxiety and depression (Hopkins Symptom Check list-25 [HSCL-25]; 25 items) among Dutch and refugee patients with different linguistic backgrounds. Method: We applied exploratory factor analyses and measurement invariance analyses to test construct validity.Participants (n =1 256) were divided into five linguistic groups defined by language family, including four non-western linguistic groups (Indo-Iranian [n = 262], Niger-Congo [n = 134], Semitic [n = 288], and South Slavic languages [n = 199]) and one western linguistic group (Germanic languages; Dutch [n = 373]). Results: Exploratory factor analysis yielded a 3-factor structure of the HTQ and a 2-factor structure of the HSCL-25. Measurement invariance 20 analyses on the HTQ showed strong measurement invariance across the groups of refugee patients. However, Dutch patients reported milder symptom severity on most items of the HTQ. Measurement invariance analyses on the HSCL-25 (not conducted in Dutch patients) indicated partial strong measurement invariance across refugee patients. Conclusion: We conclude that mental health constructs measured by the HTQ and the HSCL25 25 are to a large extent interpreted in a similar way by refugee patients. This indicates that these instruments can be applied in non-western refugee patient populations, and that local idioms of distress and inherent response patterns may not play a major role when applying the HTQ and the HSCL-25 in these populations. Yet, whereas meaningful comparisons of observed PTSD and depression scores between groups of refugee patients with different non30 western linguistic background are feasible, comparisons between patients with a western and non-western linguistic background, as well as comparisons of anxiety scores, are likely to be biased. Future studies need to establish whether the commonly used cut-off scores of both questionnaires apply for refugee patients with non-western linguistic backgrounds.


Introduction
On the borders of Europe the initial hope of the Arabic democratic uprisings in 2011 (the Arab Spring) has faded away and in some cases this political transformation has surmounted in protracted civil wars such as in Syria. The consequent influx of refugees in Europe (UNHCR, 2015) is currently dominating the news and the political debate. War experiences, persecution, hunger, loss of loved ones, a long and unsafe journey, and settlement in refugee camps all take their mental and physical toll (de Jong, Komproe, & Van Ommeren, 2003;Hassan, Ventevogel, Jefee-Bahloul, Barkil-Oteo, & Kirmayer, 2016). The resulting long-and short-term mental health and psychosocial consequences are many and varied, and a proportion of refugees seek health care for these mental health problems in their host country (de Jong, 2002;de Jong, Komproe, & Van Ommeren, 2003;Hassan et al., 2016).
To assess the impact of experiences among arriving refugees, tools such as mental health questionnaires are widely used (e.g. Buhman et al., 2014;Hollifield et al., 2013). At the individual level, assessment tools help clinicians to triage patients, target symptoms, and to assess treatment outcomes (Rasmussen, Verkuilen, Ho, & Fan, 2015). At the group level, assessments provide information about subpopulations that need treatment resources, therapeutic modalities that are more effective, and mental health information about patient populations in general (see Narrow, Rae, Robins, & Regier, 2002;Rasmussen et al., 2015).
In a previous editorial of this journal, Olff (2015) highlighted that the selection of the appropriate instruments to assess psychological complaints among refugees is not an easy task. Within this selection process, clinicians also need to bear in mind whether an instrument is valid across different linguistic groups (see Fodor, Pozen, Ntaganira, Sezibera, & Neugebauer, 2015). Health professionals are using assessment tools in linguistic groups for whom these instruments were not originally developed (Miller, Kulkarni, & Kushner, 2007;Rasmussen et al., 2015).
Questionnaires are used to measure a variety of complaints and emotions, such as depression, anxiety, and post-traumatic stress symptoms. Despite evidence of underlying universality in the experience of these mental health complaints, differences in the salience, manifestation, and expression of symptoms may be substantial across various cultures (e.g. Sweetland, Belkin, & Verdeli, 2014). In many non-western contexts the words 'depression' or 'anxiety' do not have direct equivalents. A local language may use a number of expressions, metaphors, proverbs, or emotion words to express a complaint or an emotion that is quite different from western jargon (de Jong, 2002;Kaiser et al., 2015). The way people label complaints or emotions are termed 'idioms of distress ' (de Jong, 2002).
Differences in idioms of distress may simply refer to a different wording of the same mental health concept, but at worst these linguistic differences may reflect actual differences in mental health concepts (Miller et al., 2007; cf. Wind, Joshi, Kleber, & Komproe, 2014;Poortinga, 1975). In the latter scenario, the consequence could be that the covariance between the items of the questionnaire that refer to the latent mental health concept may be different across linguistic groups (see Dyer, Hanges, & Hall, 2005). Thus, differences in idioms of distress may ultimately undermine the validity of mental health questionnaires that were developed in western languages. We do not know which concept holds true until we examine the validity of mental health questionnaires among linguistic groups.
Building upon previous research on global mental health in this journal (Hall & Olff, 2016;Purgato & Olff, 2015), the aim of this study was to examine whether two widely used mental health questionnaires -Harvard Trauma Questionnaire (HTQ) and Hopkins Symptom Checklist-25 (HSCL-25)assess the same mental health concepts across groups of refugees with different linguistic backgrounds. If the instruments measure the same mental health concepts across groups with different linguistic backgrounds, the mental health questionnaires are called 'measurement invariant' across these groups. If measurement invariance (MI) can be demonstrated, this implies that the items of the mental health questionnaires as well as the mental health concepts they are measuring are interpreted in the same way by individuals with different linguistic backgrounds (Horn & McArdle, 1992;Van De Schoot, Lugtig, & Hox, 2012). Only when MI holds for a mental health questionnaire, cross-group differences in scores on mental health constructs are meaningful (Meredith, 1993;Steenkamp & Baumgartner, 1998;Van De Schoot et al., 2012). Methodologically, scholars have examined MI of mental health questionnaires using confirmatory factor analyses (CFA; Charak et al., 2014;Contractor et al., 2015;Fodor et al., 2015;Schnyder et al., 2015). CFA is a widely-used technique for testing MI of mental health questionnaires.
The wealth of factor analytic research on mental health questionnaires has not been linguistically and culturally evenly distributed. Studies from non-Euro-American samples are exceedingly sparse. Fodor et al. (2015) examined the factor structure of PTSD within a group of Rwandan adults who experienced trauma during the 1994 genocide using the PTSD-Checklist -Civilian Version (PCL-C). Their results suggest that the latent structure of PTSD found in this sample was comparable to Euro-American samples. Charak et al. (2014) found that the Dysphoric Arousal Model of PTSD assessed by the PCL-C was the best model in an Indian sample, although the fit indices of all PTSD models were fairly similar, which underlines the cross-cultural validity of PTSD symptomatology. In a study by Contractor et al. (2015), the structural invariance of PTSD 5-factor model across Hispanic and Caucasian groups was supported.
Previously, Rasmussen and colleagues examined MI of the most widely used PTSD measure in refugee populations, the HTQ (Mollica et al., 1992). They showed that posttraumatic stress is conceptually comparable in a multinational and multilingual sample of asylum seekers from 81 countries of origin in 11 global regions, yet comparisons of mean and sum scores and of symptoms over time were not meaningful. These findings called into question the common practice of using standard cut-off scores on PTSD measures across culturally dissimilar refugee populations.
Thus, we examined MI of symptom severity of depression and anxiety as assessed by the HSCL-25 and of posttraumatic distress as assessed by the HTQ in a large sample of refugees across four non-western linguistic groups (Indo-Iranian languages, Niger-Congo languages, Semitic languages, and South Slavic languages) and one western linguistic group (Germanic language).

Participants
Participants were Dutch and refugee patients referred for treatment at Foundation Centrum '45, a specialized Dutch centre for treatment and diagnosis of complex psychotrauma (i.e. PTSD with comorbid disorders). In 2001, Foundation Centrum '45 started to routinely monitor treatment outcomes by administering questionnaires to patients during treatment. For the present study, participants were included for whom intake data collected with the HTQ and/or the HSCL-25 were available. Because the HSCL-25 was only conducted to refugee but not to Dutch patients, no data collected with the HSCL-25 were available for Dutch patients.
Because refugees with a large variety of native languages participated, homogeneous groups were composed based on the language family to which the language of the refugee's country of origin belongs (Katzner, 2002). Language family can be defined as a group of languages which are related because they descend from a common ancestor. Languages within the same family have observable shared characteristics that are not attributed to contact or borrowing. Data on the HTQ and/or the HSCL-25 were available for 1717 participants. A total of 1256 (73%) participants were divided into five main linguistic groups defined by language family (Katzner, 2002): Indo-Iranian languages (included: Iran, Afghanistan); Niger-Congo languages (included: Angola, Burkina Faso, Burundi, Cameroon, Cote d'Ivoire, Gambia, Guinea, Kenya, Liberia, Nigeria, (Democratic) Republic of Congo, Rwanda, Sierra Leone, Togo, Uganda); Semitic languages (included: Algeria, Egypt, Eritrea, Ethiopia, Iraq, Kuwait, Lebanon, Libya, Palestine, Syria); South Slavic languages (included: Bosnia and Herzegovina, Croatia, Macedonia, former Yugoslavia); and Germanic languages (included: The Netherlands). Sample sizes of the linguistic groups of the remaining 461 (27%) participants were too small (N = 1-31) to make a fair comparison between linguistic groups. Therefore, these participants were excluded from the analyses. In the upper part of Table 1, sample sizes and demographic characteristics for the total sample, as well as the five linguistic groups, are presented. Participants with a Germanic (30%) and Niger-Congo (11%) linguistic background constituted the largest and smallest subsample respectively. Sample sizes of the linguistic groups for the MI analysis of the HTQ ranged between 132 and 373, and between 123 and 257 for the MI analysis of the HSCL-25. Participants were mostly male (71%) and had a mean age of 43.3 years.

Measures
The HTQ (Mollica et al., 1992) is a self-report questionnaire assessing traumatic experiences and PTSD symptom severity. The HTQ consists of three parts of which only the second part is used in the present study. In the second part, severity of DSM-IV PTSD-symptoms is assessed by asking participants how much they were bothered by 16 PTSD-symptoms during the past week, rated on a 4-point scale (not at all, a little bit, quite a bit, or extremely). PTSD symptom severity is computed by averaging responses on the list of 16 PTSD-symptoms (range: 1-4). The HTQ recommends a clinical cut-off score of 2.5 to identify clinically significant PTSD.
The HSCL-25 (Mollica et al., 1996) is a self-report questionnaire assessing symptom severity with regard to anxiety and depression. Participants are asked to indicate how much they were bothered by 10 symptoms of anxiety and 15 symptoms of depression during the past week, rated on a 4-point scale (not at all, a little bit, quite a bit, or extremely). Symptom severity with regard to anxiety and depression is computed by averaging responses on the anxiety and depression items (range: 1-4). The HSCL-25 recommends a cut-off score of 1.75 to indicate clinically significant anxiety or depression.
The HTQ and the HSCL-25 are widely used with refugees and are available in many different languages (e.g. Amharic, Dari, English, French, Portuguese, Somali, Spanish, and Turkish). In addition, both instruments were translated into the most common languages spoken by refugees referred for treatment at Foundation Centrum '45 (Arabic, Farsi, Serbo-Croatian, and Russian). Translations were carried out by certified translators or by bilingual staff members of Centrum '45 and were reviewed by other certified translators (see Kleijn, Hovens, & Rodenburg, 2001). For the majority of individuals in each linguistic group, previously translated questionnaires could be Table 1. Demographic characteristics and symptom severity with regard to PTSD, anxiety and depression for the total sample and the five linguistic groups. used. Translations were, however, not available for all languages within the linguistic groups and interpreters were used for the minority of individuals in each linguistic group for whom no translated questionnaires were available.

Measurement invariance
In the present study MI is tested by a typical sequence of factor models with categorical factor indicators representing different levels of MI (for a detailed description see Millsap & Yun-Tein, 2004;Van Den Berg & Lance, 2000). The first level of MI, configural invariance, indicates that the construct under study is conceptualized in the same way by individuals from different groups (Steenkamp & Baumgartner, 1998). Configural invariance is met when the same factor structure holds across groups, but parameter estimates (i.e. factor loadings, thresholds, and residual variances) may vary across groups. When configural invariance is met, this does not mean that individuals respond in a similar way to the items, nor that crossgroup comparisons of mean differences on the underlying construct are meaningful. This is captured by the second level of MI, strong measurement invariance, indicating that the strength of the relations between the items and the underlying construct is similar across groups, i.e. that individuals in different groups attribute the same meaning to the construct under study (Steenkamp & Baumgartner, 1998). It also implies that cross-group comparisons of mean differences on the underlying construct are meaningful. Strong MI holds when factor loadings and thresholds are equal across groups (Steenkamp & Baumgartner, 1998). The third and most stringent level of MI, strict measurement invariance, indicates that the underlying construct is measured identically across groups. If this level of MI is not met, crossgroup comparisons of mean differences on the underlying construct are still meaningful, although means on the underlying construct are measured with different amount of error between groups (Steenkamp & Baumgartner, 1998;Van de Schoot, Lugtig, & Hox, 2012). Strict MI holds when factor loadings, thresholds, and residual variances are equal across groups (Steenkamp & Baumgartner, 1998). If a certain level of MI does not hold, the sequence of model testing stops. Strong MI does not hold when one or more factor loadings or thresholds are not invariant across groups. It has been shown that when strong MI is not met, cross-group comparisons of latent (i.e. not observed) mean differences are still meaningful as long as strong MI holds for at least two items (Byrne, Shavelson, & Muthén, 1989). However, strong MI is necessary for cross-group comparisons of observed sum or mean scores on a scale (Van de Schoot, Lugtig, & Hox, 2012). If strong MI does not hold, it should be established whether there is partial MI. This can be done by scrutinizing parameter estimates and relaxing constraints on those factor loadings and thresholds that show substantive differences between groups (Steenkamp & Baumgartner, 1998;Van de Schoot, Lugtig, & Hox, 2012).

Statistical analyses
The software package MPlus Version 7.3 (Muthén & Muthén, 1998 was used to establish the factor structure of the items of the HTQ and the HSCL-25 in the study sample in an exploratory factor analysis (EFA) for ordinal data with the weighted least squares means and variance adjusted (WLSMV) estimation. An underlying normal distribution was assumed for each item, where the four response categories were divided by three thresholds which were estimated from the data. Several models with different factor solutions were examined. Kaiser criterion (i.e. eigenvalues of the factors need to be larger than 1.0) and model fit statistics CFI, TLI, and RMSEA were used to assess the number of latent factors needed to adequately account for the correlation among item scores. The model with the best balance between model fit, parsimony, and interpretability was selected as the best factor model.
Subsequently, MPlus was used to conduct single and multigroup CFA to test different levels of MI of the HTQ and the HSCL-25 across linguistic groups. Configural invariance was tested by fitting the best factor models of the HTQ and the HSCL-25 from the EFA in a multigroup CFA of the total sample and in single-group CFAs for each of the linguistic groups separately. In the multigroup CFA, factor loadings and thresholds were freely estimated across linguistic groups, and residual variances were fixed at one in all groups. Strong MI was tested by fitting a multigroup CFAs in which factor loadings and thresholds were constrained to be equal. Residual variances were fixed at one in the first group and freely estimated in the other groups. It was tested whether the fit of the model representing strong MI was better compared to the model representing configural invariance. Partial strong MI was tested by relaxing equality constraints on factor loadings and thresholds for those items that showed substantive cross-group differences with regard to factor loadings and/or thresholds. It was tested whether the fit of the model representing partial strong MI was better compared to the model representing configural invariance. Strict MI was assessed by fixing residual variances at one across those groups in which factor loadings and thresholds were allowed to be constrained to be equal. It was tested whether this model fit the data better compared to the model representing (partial) strong MI.
Single and multigroup CFAs with categorical factor indicators were estimated with the WLSMV estimator using the THETA parameterization. CFI, TLI, and RMSEA were used to assess model fit. For CFI and TLI, model fit is considered good when values are close to .95 (Hu & Bentler, 1999). It must be noted that TLI is sensitive to small sample sizes (Van de Schoot, Lugtig, & Hox, 2012). RMSEA is considered adequate when the value is < .08 and good when it is < .05 (Browne & Cudeck, 1993;Schermelleh-Engel & Moosbrugger, 2003). The difference in goodness-of-fit between nested MI models was evaluated by the χ 2 difference test and the difference in CFI between two nested models. The 'difftest' option in MPlus was used for appropriate χ 2 difference testing with the WLSMV estimator (Muthén & Muthén, 1998. The χ 2 difference test is highly sensitive to sample size such that even trivial differences between two nested models may be significant (Cheung & Rensvold, 2002). As an alternative, it has been suggested to interpret the χ 2 difference test by the ratio of the χ 2 value and the number of estimated parameters (χ 2 /df). A χ 2 /df ratio of less than 3 indicates a better fit of the nested model compared to the more complex model (Schermelleh-Engel & Moosbrugger, 2003). A difference in CFI < 0.01 also indicates a better fit of the nested model compared to the more complex model (Cheung & Rensvold, 2002).

Results
Symptom severity with regard to PTSD, anxiety, and depression for the total sample, as well as the five linguistic groups, is presented in the lower part of Table 1. Mean severity level of posttraumatic stress symptoms was 2.9 in the total sample, with 74% of participants being symptomatic for posttraumatic stress disorder (ranging between 49% in the linguistic group of Germanic languages and 89% in the linguistic group of Semitic languages). Mean symptom severity with regard to anxiety and depression was 2.9 in the total sample, with 94% and 96% being symptomatic for anxiety and depressive disorder respectively. Differences in the number of participants being symptomatic for anxiety and depressive disorder between the linguistic groups were small.

EFA and MI analysis of the HTQ
First, EFA was conducted on the total sample in order to establish the factor structure of the HTQ in the present sample. Based on model fit and eigenvalues, EFA yielded a 3-factor solution as a good fit for the 16 items of the HTQ. Table 2 presents the unstandardized Geomin rotated factor loadings and eigenvalues of the 3-factor solution. CFI (.980) and TLI (.968) indicated good model fit and RMSEA (.060) indicated adequate model fit. Eigenvalues of the three factors were larger than one whereas eigenvalues of the fourth to sixteenth factor were lower than one (i.e. ranging between .199 and .785). The items that cluster on the same factor suggest that the first factor (items 1, 2, 3, 8, and 16) reflects symptoms of intrusion, the second factor (items 4, 5, 6, 7, 9, 10, 13, and 14) hypervigilance, and the third factor (items 11, 12, and 15) avoidance. Table 3 presents model fitting results of the MI analysis across five linguistic groups that was conducted on the 3-factor model for PTSD that resulted from the EFA on the items of the HTQ. In model 1, a multigroup 3-factor model of PTSD was tested. Factor loadings and thresholds were freely estimated in each of the linguistic groups. CFI and TLI indicated good model fit, and RMSEA indicated adequate model fit. Model 1a-1e tested the 3-factor model of PTSD for each of the linguistic groups separately. CFI and RMSEA indicated that the model fit in each of the linguistic groups was adequate to good. TLI indicated that the model fit was good for the Indo-Iranian, Niger-Congo, and Germanic language group but not for the Semitic and South-Slavic language group since it deviated substantially from .95. It must be noted that TLI is sensitive to small samples and the actual sample sizes of the individual linguistic groups are relatively small. Because model fit indices of the multigroup 3factor model for PTSD representing configural MI (model 1) were adequate to good and model fit indices of the 3-factor model for PTSD in each of the relatively small linguistic groups were mainly adequate it was concluded that configural invariance holds for the HTQ across five linguistic groups.
Model 2 tested the multigroup 3-factor model of PTSD representing strong MI. Factor loadings and thresholds were constrained to be equal across five linguistic groups. CFI and TLI indicated good model fit and RMSEA indicated adequate model fit. The χ 2 /df ratio indicated that the fit of model 2 was not worse compared to model 1. The difference in CFI between model 1 and model 2 indicated a worse fit of model 2 compared to model 1.
Factor loadings and thresholds (see Supplementary Tables S1 and S2) were scrutinized to investigate possible differences between linguistic groups. Thresholds appeared to differ substantively between the Germanic language group and the other linguistic groups, indicating that participants in the Germanic language group systematically reported milder symptom severity on the items of the HTQ. No systematic differences with regard to factor loadings appeared. Model 3 tested a multigroup 3-factor model of PTSD representing partial strong MI. Factor loadings and thresholds were constrained to be equal in the Indo-Iranian, Niger-Congo, Semitic, and South Slavic language groups and were freely estimated in the Germanic language group. CFI and TLI indicated good model fit and RMSEA indicated adequate model fit. The χ 2 /df ratio, as well as the difference in CFI between model 1 and model 3, indicated that the fit of model 3 was not worse than model 1. Therefore, model 3 is preferred over model 1 and model 2 and it can be concluded that partial strong MI invariance holds for the HTQ across linguistic groups. More specifically, strong MI held across the Indo-Iranian, Niger-Congo, Semitic, and South Slavic language groups, but not for the Germanic language group.
Model 4 tested a multigroup 3-factor model of PTSD representing partial strict MI. Factor loadings and thresholds were constrained to be equal, and residual variances were fixed at one in the Indo-Iranian, Niger-Congo, Semitic, and South Slavic language group. In the Germanic language group, factor loadings and thresholds were freely estimated and residual variances were fixed at one. CFI and TLI indicated good model fit and RMSEA indicated adequate fit. The χ 2 /df ratio was slightly larger than the cut-off value of 3, indicating that the fit of model 4 was worse compared to model 3. The difference in CFI between model 3 and model 4 indicated that the fit of model 4 was not worse compared to model 3.
Based on the goodness-of-fit indexes model 3 was preferred over model 4. It was therefore concluded that partial strict MI does not hold across the Indo-Iranian, Niger-Congo, Semitic, and South Slavic language groups, and that model 3 representing partial strong MI fit the data best.

EFA and MI analysis of the HSCL-25
EFA on the total sample was first conducted to establish the factor structure of the HSCL-25 in the present sample. Based on model fit and eigenvalues, EFA yielded a 3-factor solution as a good fit for the 25 items of the HSCL-25. CFI (.973), TLI (.964), and RMSEA (.044) indicated good model fit. Eigenvalues of the three factors were larger than one, whereas eigenvalues of the fourth to twenty-fifth factor were lower than one (i.e. ranging between .202 and .968). Table 4 presents the unstandardized Geomin rotated factor loadings and eigenvalues of the 2-factor and 3factor solution. With regard to the 3-factor solution, it can be seen that items clustering on the first and second factor highly overlap, indicating that the first and second factor are insufficiently distinctive. Therefore, the 2-factor solution was preferred over the 3-factor solution. CFI (.942), TLI (.931), and RMSEA (.062) indicated adequate model fit for the 2-factor solution, and eigenvalues were larger than one. A low factor loading (λ = .28) was observed for item 8 ('Headaches') on the first and second factor, indicating that this item does not add substantively to both factors. EFA was therefore rerun without item 8 and yielded a 2-factor solution with adequate fit (CFI = .944, TLI = .933, RMSEA = .062) and eigenvalues larger than one for the 24 remaining items of the HSCL-25. This model was selected as the best model. The items that cluster on the same factor suggest that the first factor (items 1-7 and 9-10) reflects symptoms of anxiety and the second factor (items 11-25) represents symptoms of depression. Table 5 presents model fitting results of the MI analysis across four linguistic groups that was conducted on the 2-factor model of anxiety and depression that resulted from the EFA on the items of the HSCL-25. In Best fitting model is printed in bold; vs. = versus; χ 2 , df = chi-square test statistic and degrees of freedom for model; Δχ 2 , Δdf = chi-square test statistic and degrees of freedom for chi-square difference test between two nested models; χ 2 /df = ratio between χ 2 and degrees of freedom with regard to the chi-square difference test; ΔCFI = difference in CFI value between two nested models.
model 1, a multigroup 2-factor model of anxiety and depression was tested. Factor loadings and thresholds were freely estimated in each of the linguistic groups. CFI, TLI, and RMSEA indicated adequate model fit. In model 1a-1d, the 2-factor model of anxiety and depression was tested for each of the linguistic groups separately. All fit indices indicated adequate model fit in each of the subsamples. Based on the model fitting results of model 1 and model 1a-1d it can be concluded that configural invariance holds for the HSCL-25 across four linguistic groups. Model 2 tested the multigroup 2-factor model of anxiety and depression representing strong MI. Factor loadings and thresholds were constrained to be equal across groups. CFI, TLI, and RMSEA indicated adequate model fit. The χ 2 /df ratio indicated that the fit of model 2 was not worse compared to model 1. The difference in CFI between model 1 and model 2 indicated a worse fit of model 2 compared to model 1.
Factor loadings and thresholds (see Supplementary  Tables S3 and S4) were subsequently scrutinized to investigate possible differences between linguistic groups. No systematic differences with regard to factor loadings and thresholds were observed. Differences between linguistic groups were generally small, with the exception of factor loadings and thresholds regarding item 4 (Nervousness or shakiness inside). Model 3 tested a multigroup 2-factor model of anxiety and depression representing partial strong MI. In this model, factor loadings and thresholds with regard to item 4 were freely estimated across linguistic group whereas all other factor loadings and thresholds were constrained to be equal. CFI, TLI, and RMSEA indicated adequate model fit. The χ 2 /df ratio, as well as the difference in CFI between model 1 and model 3, indicated that the fit of model 3 was not worse than model 1. Therefore, model 3 is preferred over model 1 and model 2 and it can be concluded that partial strong MI invariance holds for the HSCL-25 across linguistic groups.

Overview
This study investigated the factor structure and MI of two widely used instruments for the assessment of PTSD symptoms (HTQ) and symptoms of anxiety and depression (HSCL-25) among Dutch and refugee patients with different linguistic backgrounds. EFA yielded a 3-and a 2-factor structure for the items of the HTQ and the HSCL-25, respectively. In addition, MI analyses on the HTQ showed strong MI across the groups of refugee patients with Indo-Iranian, Niger-Congo, Semitic, and South Slavic language backgrounds. Strong MI could, however, not be demonstrated across the group of Dutch patients with a Germanic language background and the groups of refugee patients with non-western language backgrounds. MI analyses on the HSCL-25 indicated Table 4. Geomin rotated factor loadings and eigenvalues of the 3-and 2-factor model of the HSCL-25 as estimated by EFA.  Best fitting model is printed in bold; vs. = versus; χ 2 , df = chi-square test statistic and degrees of freedom for model; Δχ 2 , Δdf = chi-square test statistic and degrees of freedom for chi-square difference test between two nested models; χ 2 /df = ratio between χ 2 and degrees of freedom with regard to the chi-square difference test; ΔCFI = difference in CFI value between two nested models.
partial strong MI across the four non-western linguistic groups of refugee patients.

Factor structure and MI of the HTQ
Armour and colleagues (2016) stated that consensus regarding the exact number and nature of factors is yet to be reached, despite numerous studies on the factor structure of PTSD. We found a 3-factor solution in which the items of the HTQ of symptoms of intrusions were represented by the first factor, symptoms of hypervigilance by the second factor, and symptoms of avoidance by the third factor in line with the DSM-IV criteria of PTSD. Armour and colleagues (2016) showed that 4-factor models received substantial support (e.g. Elklit & Shevlin, 2007;Palmieri, Marshall, & Schell, 2007), but the 5-factor dysphoric arousal model demonstrated the best fit (e.g. Charak et al., 2014). Thus, in contrast to our results, most studies provided evidence for the recently proposed DSM-5 PTSD model (e.g. Fodor et al., 2015;Vindbjerg, Carlsson, Mortensen, Elklit, & Makransky, 2016).
In MI analyses, we showed that posttraumatic stress as measured by the HTQ is conceptualized by symptoms of intrusion, hypervigilance, and avoidance by Dutch patients with a Germanic language background as well as by refugee patients with Indo-Iranian, Niger-Congo, Semitic, and South Slavic language backgrounds (i.e. configural invariance). Dutch patients reported milder symptom severity on most items of the HTQ. This result is consistent with previous findings that immigrants tend to report higher levels of complaints on questionnaires than the dominant group in the host country (He & Van De Vijver, 2013;Morren, Gelissen, & Vermunt, 2012). Differences of observed scale scores between Dutch patients and refugee patients with non-western language backgrounds either reflect measurement bias instead of true underlying differences in PTSD symptom severity (Meredith, 1993;Van de Schoot, Lugtig, & Hox, 2012) or reflect the notion that refugees score higher on PTSD as a result of experiencing more traumatic events (e.g. de Jong et al., 2001). We conclude that it is advisable to develop differentiated cutoff scores with regard to the HTQ for patients with a western language background and for refugee patient groups with non-western language backgrounds.
In contrast, strong MI of the HTQ was demonstrated across the groups of refugee patients with Indo-Iranian, Niger-Congo, Semitic, and South Slavic language backgrounds. This means that the items of the HTQ as well as the concepts they are measuring (i.e. the PTSD symptom dimensions of intrusion, hypervigilance, and avoidance) are interpreted in the same way by refugee patients with different non-western linguistic backgrounds (Horn & McArdle, 1992; Van de Schoot, Lugtig, & Hox, 2012). Therefore, meaningful comparisons of observed PTSD scale scores on the HTQ between refugee patients with different non-western linguistic backgrounds can be made. Likewise, the use of a single PTSD cut-off score with regard to the HTQ in groups of refugee patients with different non-western linguistic backgrounds seems feasible.

Factor structure and MI of the HSCL-25
According to Al-Turkait and colleagues (2011), most scholars found evidence for the 2-factor model and the 3-factor model of the HSCL-25. The 2-factor model comprises symptoms specific to anxiety and symptoms specific to depression, and the 3-factor model additionally distinguishes nonspecific symptoms of general distress which the two disorders share (Al-Turkait et al., 2011). Glaesmer and colleagues (2013) concluded that, because of the high intercorrelations of the factors of the tripartite model, the 2-factor model is the preferable factor solution. Similarly, we found that the HSCL-25 was represented best by a 2-factor model comprising symptoms of anxiety and symptoms of depression. Although research showed that headaches are usually part of the anxiety scale (e.g. Al-Turkait et al., 2011;Glaesmer et al., 2013) or at least coincide with depression and anxiety (Juang, Wang, Fuh, Lu, & Su, 2000;Zwart et al., 2003), in our non-western refugee groups headache was part of neither the depression nor the anxiety scale. This indicates that among non-western refugee groups headache is not part of depression or anxiety.
In addition, MI analyses indicated that it can be concluded that anxiety and depression items and the underlying constructs as measured with the HSCL-25 are interpreted in the same way by refugee patients with different non-western linguistic backgrounds, with the exception of one item (i.e. Nervousness or shakiness inside) regarding the anxiety construct to which they appeared to respond differently. Crossgroup comparisons of observed anxiety scores are only meaningful when the non-invariant item is discarded. Yet, the question remains whether the commonly used cut-off score for anxiety with regard to the HSCL-25 applies to this scale.
Previous studies have focused on configural invariance to examine whether screening outcomes can be compared across linguistic or cultural groups (e.g. Fodor et al., 2015), providing evidence for conceptual similarity of mental health concepts are. We conclude, in line with these previous findings, that depression, anxiety, and posttraumatic stress are conceptually similar across our groups under study. The strength of our study is that it is one of the very few that examined strong and strict MI as well beyond this traditional and common question of configural invariance (see also Rasmussen et al., 2015). Based on our findings we conclude that mental health questionnaires do indeed help clinicians in their fundamental task to target symptoms and assess treatment outcomes among refugees. Since our results suggested that PTSD, anxiety, and depression are conceptualized in a similar way by groups of refugees with different non-western linguistic backgrounds, and that they interpret items with regard to these concepts in the same way, it can be concluded that local idioms of distress and inherent cultural response patterns may not play a major role when using the HTQ and the HSCL-25. Future studies need to examine whether the commonly used cut-off scores with regard to both questionnaires apply for refugee patients with non-western linguistic backgrounds. We add that ideally one must carefully make an inventory of the expression of distress in other languages before one can conclude that the way people perceive their problem may or may not overlap DSM-IV or DSM-5 categories. Local categories of emotional distress help place the instruments within their proper cultural context (Bolton & Tang, 2002;de Jong, 2002).

Limitations
One limitation is that we did not conduct the HSCL-25 to Dutch patients. Consequently, we could not compare the non-western groups of refugees with a western group, whereas our findings on the HTQ found different thresholds for the Dutch groups compared to the non-western groups of refugees. Another limitation is that the linguistic groups differ in sample size and this may have biased the outcomes of our multigroup CFA. A simulation study of Meade and Bauer (2007) indicated that the precision of estimated factor loading differences is high for sample sizes of 400, but varied somewhat by condition at sample sizes of 100 and 200. Since sample sizes of all the linguistic groups were smaller than 400, this may have biased our outcomes.

Conclusion
Because of the huge number of refugees that currently cross the European borders (UNHCR, 2015), of whom most are severely traumatized, there is a need to detect those who suffer from psychological complaints to be able to meet their mental health needs. Our study results indicate that the HTQ and the HSCL-25 can be useful in this respect. They can be applied in non-western refugee patient populations. Local idioms of distress and response patterns may not play a major role when using the HTQ and the HSCL-25 among non-western refugee patients.
Future studies need to examine whether the used cut-off scores with regard to both questionnaires need to be reconsidered for refugee patients with non-western linguistic backgrounds. Although meaningful comparisons of observed PTSD and depression scores between groups of refugee patients with different non-western linguistic background are feasible, comparisons between patients with a western and non-western linguistic background, as well as comparisons of anxiety scores, are likely to be biased.
This study is one of the few to test different levels of MI and provide evidence for partial MI of mental health questionnaires among non-western refugees, yielding a discerned answer to the construct validity question of mental health concepts among refugees. As such, this research is an invitationand perhaps a roadmapfor future researchers to further test these findings.

Highlights
• We conclude that mental health constructs of PTSD, anxiety, and depression, as measured by the HTQ and the HSCL-25, are to a large extent interpreted in a similar way by refugee patients. • Local idioms of distress and inherent response patterns may not play a major role when applying the HTQ and the HSCL-25 in non-western refugee patient populations. • Our study is one of the few to provide evidence for (partial) strong measurement invariance of mental health screeners among refugees.

Disclosure statement
No potential conflict of interest was reported by the authors.