Does Acquiescence Disagree with Measurement Invariance Testing?

Measurement invariance (MI) is required for validly comparing latent constructs measured by multiple ordinal self-report items. Non-invariances may occur when disregarding (group differences in) an acquies-cence response style (ARS; an agreeing tendency regardless of item content). If non-invariance results solely from neglecting ARS, one should not worry about scale inequivalences but model the ARS instead. In a simulation study, we investigated the effect of ARS on MI testing, both when including ARS as a factor in the measurement model or not. For (semi-) balanced scales, disregarding a large ARS resulted in non-invariance already at the configural level. This was resolved by including an ARS factor for all groups. For unbalanced scales, disregarding ARS did not affect MI testing, and including an ARS factor often resulted in non-convergence. Implications and recommendations for applied research are discussed.


Introduction
Social and behavioral scientists are often interested in assessing whether groups of individuals differ regarding latent constructs (e.g., extraversion).These unobservable constructs are often measured by self-report scales.Commonly, these scales consist of questionnaire items, where, for each item, respondents rate their level of agreement by selecting one of a few ordered response options on a Likert scale (e.g., "disagree", "neutral", "agree").
To validly draw conclusions about group differences on latent constructs, scales must function equivalently across groups.Measurement invariance (MI) testing evaluates the tenability of this hypothesis by assessing whether the measurement model (MM) of the psychological construct is equivalent across groups.As an example of an inequivalence, one may think of differences in item interpretations that may lead one group to systematically pick lower/higher response options for some items, which can result in under/overestimation of sum-scores (Jeong & Lee, 2019), item means (Jones & Gallo, 2002), and regression parameters in structural equation models (Guenole & Brown, 2014).Thus, testing for MI is an essential precursor to investigating group differences (Borsboom, 2006;Meredith & Teresi, 2006) to avoid building on latent construct differences that are purely due to measurement discrepancies and thus invalid or "biased".
Measurement invariance is often tested with a latent variable approach, which models the relationship between unobserved psychological constructs (i.e., latent variables) and observable behaviors (i.e., items).Within the latent variable modeling framework, multiple group categorical confirmatory factor analysis (MG-CCFA) and multiple group item response theory (MG-IRT) are the most popular approaches to evaluate MI for models with ordinal data (i.e., items with too few response categories to treat them as continuous).Note that while equivalences can be drawn between MG-CCFA and MG-IRT models (Chang et al., 2017), some differences remain in the way MI is tested within each of these two approaches (D'Urso et al., 2022).
For instance, MG-CCFA primarily focuses on assessing MI at the scale level (i.e., for the complete set of items measuring a construct), whereas MG-IRT traditionally tests MI for each item separately.In this paper, we focus on a MG-CCFA-based MI testing approach (i.e., scale level) since it is more commonly used in practice (Putnick & Bornstein, 2016).Specifically, in a MG-CCFA-based MI testing approach, different levels of MI are assessed in a step-wise procedure.For each step, increasingly restrictive models are estimated by imposing equality constraints on specific MM parameters.Then, the fit of the more constrained model to the data is compared to that of the less constrained one to evaluate whether the equality constraints worsen the model fit significantly, thus indicating non-invariance of (at least some of) the constrained MM parameters.Apart from measurement model differences or "noninvariances" such as differential item interpretations, bias in latent variable comparisons may also arise when responses to self-report item rating scales are affected by response tendencies or response styles in some groups but not in others (Cheung & Rensvold, 2000).Acquiescence, or agreeing, response style (ARS) is a well-known one, which represents a tendency to agree with items regardless of their content (Paulhus, 1991).Interestingly, various studies have indicated that different groups of individuals may have a more or less pronounced ARS depending on their education (Meisenberg & Williams, 2008), age (Weijters et al., 2010), gender (Austin et al., 2006), length of employment (Johnson et al., 2005) or culture (Bachman & O'Malley, 1984;Marin et al., 1992).ARS can inflate observed means (Van Vaerenbergh & Thomas, 2013) and affect the measurement model by introducing an additional factor (Billiet & McClendon, 2000;D'Urso et al., 2023) or changing the strength of the relationships between items and factors (i.e., factor loadings; Ferrando & Lorenzo-Seva, 2010).To control for ARS, previous research indicated that including ARS as an additional factor in the MM (Billiet & McClendon, 2000) proved to effectively reduce bias in MM parameters recovery as well as estimated factor scores (Savalei & Falk, 2014).
Though it is confirmed that not taking ARS into account affects the MM in single group studies, extensive investigations about the impact of disregarding ARS on MI testing are currently lacking.Existing studies in the literature have either focused on assessing the effect of other RSs on MI, such as extreme response style (ERS; Liu et al., 2017) and non-effortful responding (NER; Arias et al., 2020;Rios, 2021), or on evaluating the impact of including an additional ARS factor when assessing MI using empirical data (Aichholzer, 2015;Welkenhuysen-Gybels et al., 2003).In case of empirical data, the" true" MM is unknown, however.Thus, in this paper, we thoroughly assess the effects of ARS on MI testing in a simulation study.Identifying in which conditions and to what extent ARS distorts measurement invariance conclusions may give us clues on correcting for bias in latent mean differences due to differential response tendencies across groups.For instance, when the influence of ARS is disregarded, researchers may wrongly conclude that there is non-invariance.Furthermore, this may introduce bias in the latent means of the groups and thus mislead conclusions about between-group differences in latent means.Taking ARS into account when testing for MI likely facilitates distinguishing non-invariance of the scale itself from (amendable) non-invariance due to disregarding ARS.Indeed, if non-invariance results from not taking ARS into account, one needs to correct for the ARS instead of worrying about the scale being inequivalent across groups.In addition to evaluating the effect of ARS on multigroup factor models rather than single-group ones, we expand the existing literature (e.g., Savalei & Falk, 2014) by (i) evaluating models for ordinal data, (ii) including multi-dimensional factor models (iii) for balanced, semi-balanced and unbalanced scales.The remainder of this paper proceeds as follows: In Section 2, we elaborate on MG-CCFA, MI testing, and how it may be affected by ARS.Then, in Section 3, we present a simulation study that evaluates the effect of ARS on MI both when (i) ARS is disregarded, and (ii) ARS is taken into account by including ARS as an additional factor in the MM.Finally, in Section 4, we discuss recommendations based on the simulation study results, limitations of our investigation, and potential future research directions.

Measurement Invariance Testing and the Potential Effects of ARS
In this section, we introduce MG-CCFA, describe the standard MI testing framework including the identification constraints proposed by Wu and Estabrook (2016) to assess MI with ordinal data, and discuss the potential effects of ARS on MI testing.

Multiple Group Categorical Confirmatory Factor Analysis
Consider having data composed of J items for a group of N subjects, and that a grouping variable (e.g., nationality) exists to divide the N subjects into G groups.Then, let x j be the polytomously scored response on item j that can take on C possible values with c ¼ {0,1,2, … ,C-1}.MG-CCFA assumes that each of the C possible observed responses is obtained from a discretization of a continuous unobserved response variable x � j through a set of threshold parameters s ðgÞ j, c , which indicate the cut-off point for the response categories (e.g., division between scoring a 1 or a 2) for group g.Note that the first and last thresholds are defined as s ðgÞ j, 0 ¼ −1 and s ðgÞ j, C ¼ þ1, respectively.Then, formally A factor-analytical model for a vector of latent response variables ) is obtained as: x � ¼m ðgÞ þK ðgÞ g ðgÞ þ� ðgÞ : (2) where m ðgÞ is a J-dimensional vector of latent intercepts (i.e., intercepts of the unobserved response variables in x � ), K ðgÞ is a J � Q matrix of factor loadings, g ðgÞ is a Q-dimensional vector of scores on the Q factors, � ðgÞ is a J-dimensional vector of residuals.Note that the latent intercepts m ðgÞ j , thresholds s ðgÞ j, c and loadings k ðgÞ j in K ðgÞ are group specific, and that, within each group g, both the factors g and the item-specific residual components � are mutually independent and normally distributed, with: and where j ðgÞ are the group-specific factor means, U ðgÞ the group-specific factors variance-covariance matrix, and W ðgÞ is a diagonal matrix containing the group-specific unique variances of the items.Further, within each group, the model-implied mean vector l and covariance matrix R is obtained as:

Measurement Invariance Testing Procedure
In MG-CCFA, MI is commonly evaluated by testing, for all items, equality of a set of MM parameters (e.g., loadings) across groups in a step-wise fashion.The starting point is to identify the MG-CCFA model, which requires: (i) setting the scale for the latent variable g, (ii) setting the scale for the unobserved response variable x � j , and (iii) aligning the scale of the latent variable g across groups.Note that the latter is necessary to make the groups comparable.Then, in addition to the identification constraints, additional equivalence constraints are imposed on MM parameters (e.g., thresholds) in a step-wise fashion to evaluate their invariance.Therefore, a new, more constrained model is estimated for each step, and its fit to the data is evaluated to conclude whether these new constraints significantly worsen the fit.Below, we first discuss the main steps and identification constraints to test MI for ordinal data following the recommendations by Wu and Estabrook (2016), summarized in Table 1, and then elaborate on standard goodness-of-fit criteria that are used to draw MI conclusions.

Configural Invariance
Configural invariance is usually the first invariance level tested, where the goal is to test the equivalence of the number of factors and of the loadings pattern (i.e., which factors are measured by which items) across groups.In this step, following Wu and Estabrook (2016), the baseline model is identified by fixing, for all groups, the latent intercepts m to 0 and variances (i.e., diagonal elements of R) to 1, which is commonly known as the delta parameterization (Muth� en & Muth� en, 2009).Similarly, the latent factor means j ðgÞ and variances / ðgÞ (i.e., diagonal elements of U ðgÞ ) are also fixed to 0 and 1, respectively.
After specifying and estimating this factor model for all groups, conclusions on configural invariance are drawn following the examination of goodness-of-fit measures.If supported, configural equivalence indicates that the shape of the model (i.e., number of factors and pattern of zero and non-zero loadings) is the same across groups.

Thresholds Invariance
If configural invariance holds, the invariance of thresholds is tested next.Here, the baseline model is identified by setting, for all groups, the latent content factor means j ðgÞ and variances / ðgÞ to 0 and 1, respectively.Additionally, for the reference group r, the vector of latent intercepts m ðrÞ and latent response variable variances in R ðrÞ are set to 0 and 1, respectively.On top of these identification constraints, thresholds s j, c are equated across groups and, after model estimation, the hypothesis of thresholds invariance is evaluated by evaluating the change in model fit between the configural model and the thresholds invariant model.

Loadings Invariance
If thresholds invariance holds, invariance of loadings is assessed.To identify the baseline model, for all groups, the latent content factor means j ðgÞ are set to 0, while, for the reference group r, the factor variances / ðrÞ are set to 1, the latent intercepts � ðrÞ to 0 and variances in R ðrÞ to 1.In addition to these identification constraints, both thresholds s j, c and loadings K are constrained to be equal across groups.Again, the model is estimated and the hypothesis of loadings invariance is evaluated by assessing the change in model fit between the thresholds invariant model and the loadings invariant model.Note that, if the hypothesis of thresholds and loadings invariance holds, factor variances can be validly compared across groups.

Intercepts Invariance
Finally, if loadings invariance holds, invariance of latent intercepts is assessed.To identify the baseline model, for the reference group, the latent content factor means j ðrÞ and variances / ðrÞ are set to 0 and 1, respectively.Additionally, building on the previous equality constraints on thresholds and loadings, the latent intercepts m are set to 0 and equated across groups.To assess the hypothesis of latent intercepts invariance, the model is estimated and its fit is compared to the loadings invariant model.Following non-rejection of latent intercepts invariance, the factor means can be validly compared across groups.

Criteria to Assess Model Fit
Goodness-of-fit indices are commonly used as criteria to assess the tenability of MI hypotheses.This commonly entails evaluating the fit of the baseline model (i.e., configural model) and then the change in fit for the more restrictive models.To aid conclusions on whether the (change in) fit allows to conclude that a certain level of invariance (e.g., thresholds) holds, various criteria are inspected, each with its own proposed cut-off value determined via extensive simulation studies.Classically, only the chi-squared v 2 test was used as a criterion to assess the significance of change for two nested models (Putnick & Bornstein, 2016) but Note.LV: latent variable; LRV: latent response variable.
STRUCTURAL EQUATION MODELING: A MULTIDISCIPLINARY JOURNAL multiple studies have shown that relying solely on this statistic is sub-optimal due to it being extremely sensitive to negligible MM differences in large samples (Bentler, 1990;French & Finch, 2006;2008).Therefore, in practice, MI decisions are based on multiple criteria (Putnick & Bornstein, 2016), and, among them, two of the most commonly used are the root mean square error of approximation (RMSEA; Browne & Cudeck, 1993) and the comparative fit index (CFI; Bentler, 1990).Configural invariance is concluded if RMSEA �0.06 and/or CFI is �0.95 (Brown, 2015).For the more restrictive models, the change in fit (e.g., DRMSEA) is assessed to conclude whether the additional constraints worsen the fit significantly.Cheung and Rensvold (2002)

From Single Group to Multiple Groups: The Potential Effects of (Not) Correcting for ARS
In the literature, the bias resulting from disregarding ARS for single-group analyses is well-known but it is not yet clear to what extent it may generalize to multiple-group analyses, such as MG-CCFA.Response tendencies, such as ARS, represent sources of systematic response bias that may or may not appear as violations of measurement invariance (i.e., measurement non-invariance).In fact, ARS is often viewed as a factor with weak to moderate loadings (Danner et al., 2015;Ferrandoet al., 2004), which may be insufficient to result in significant violations of MI (i.e., rejection of MI).When ARS affects individuals' responses in one of the groups, not taking into account this tendency towards acquiescence likely results in systematic differences in the responses across groups that are not purely due to the intended-to-be-measured (i.e., content) factors and, eventually, may lead to the rejection of measurement invariance.For instance, in single group studies, one well-known consequence of not accounting for ARS is that it may result in an additional factor (Billiet & McClendon, 2000;D'Urso et al., 2023).Therefore, it is reasonable to expect that researchers unaware of the (potential) influence of ARS would disregard this and reject configural invariance, which would lead them to conclude that the content factor(s) cannot be validly compared across groups since they (seem to) qualitatively differ.Additionally, single group studies showed that ARS can bias item (latent) intercepts (Cheung & Rensvold, 2000), and factor loadings (D' Urso et al., 2023).Ferrando & Lorenzo-Seva, 2010;Savalei & Falk, 2014; Again, neglecting this agreeing response tendency may result in non-equivalence (i.e., non-invariance) of intercepts and/or loadings, and lead researchers to conclude that the MM is non-invariant and, potentially, to allow some parameters to freely vary across groups (i.e., partial invariance) to reach an acceptable level of invariance before investigating differences in the content factor(s).Finally, even if invariance is tenable, ARS may still bias latent mean differences and thus lead researchers to conclude that the mean of the targeted latent variable differs across groups, while this may be a byproduct of neglecting an agreeing tendency in one of the groups.
Another important aspect to consider is that the performance of psychometric approaches developed to correct for ARS has not been thoroughly investigated in the context of MI testing.Savalei and Falk (2014) have discussed some of the main factor-analytical approaches to correct for ARS, their underlying assumptions and compared their performance for single group analyses through a simulation study.The results have shown that the classical CFA-based approach (Billiet & McClendon, 2000), where ARS is specified as an additional factor orthogonal to the content factor(s) with all loadings set to 1 (i.e., the influence of ARS does not vary across items) outperformed the remaining ones 1 , even when some of its main assumptions (e.g., equal ARS loadings) are violated.Thus, based on the authors' results, recommendations, and its straightforward implementation to MG-CCFA, we will mainly focus on this CFAbased approach in the remainder of this paper.Specifically, following this CFA-based approach, an additional ARS factor is added to the MM in all groups and all factor loadings on this additional factor are fixed to 1, which allows to freely estimate the ARS factor variance for all groups.Then, between-group differences in the amount of ARS are captured by differences in the ARS factor means, and withingroup differences in the strength of ARS are captured by the ARS factor variances.

Simulation Study
To assess the effect of ARS on MI testing, both when including ARS as an additional factor in the measurement model (MM) and when not including it, we conducted a simulation study where individual responses in one group were affected by an ARS.Our goal is to solely focus on whether the bias introduced by disregarding ARS results in measurement non-invariance, and whether this is rectified by including an additional ARS factor in the MM for all groups.Therefore, we did not simulate other sources of non-invariance (e.g., differences in factor loadings).Furthermore, a null scenario was simulated, where invariance holds and ARS is not at play for both groups, which only served as a comparison for evaluating the performance 1 The other approaches discussed by Savalei and Falk (2014) are the Chan and Bentler (1993) approach and the EFA-based approach (Ferrando et al., 2004).In the former, data must be first mean-centered within person (i.e., ipsatized).Then, a residual structure must be specified by adding a linear combination of the original residual components for each of the ipsatized variables.In the latter, an additional factor is first extracted, and then a rotation is performed to a partially specified target, which allows to estimate both content and ARS factor loadings.
of MG-CCFA approaches.Note that we report these latter results in the Online Supplementary (Tables A1-A3).
The following 5 factors were manipulated: � The number of subjects N within each group at 2 levels: 250, 1000; � The type of scale at 3 levels: balanced, semi-balanced and unbalanced; � The number of content factors Q at 2 levels: 1, 2; � The scale length (i.e., total number of items) at 2 levels: 12, 24; � The overall strength of the ARS factor at 2 levels: medium and large.� The difference in strength of the ARS across item at 2 levels: equal, unequal.
For the minimum sample size within each group, we followed the recommendations from previous research, which indicated that for obtaining precise factor loading estimates a sample size of 250 is sufficient when item communalities are moderate (Fabrigar et al., 1999;MacCallum et al., 1999).Furthermore, we varied (a) the number of factors to simulate both unidimensional and multidimensional scales, (b) the total number of items to simulate scales that measure the psychological construct to a varying degree of accuracy, and (c) the type of scale (e.g., semi-balanced) to emulate scales that allow disentangling the content factor(s) from the ARS factor to a different extent (de la Fuente & Abad, 2020; Savalei & Falk, 2014).Negatively keyed items may be more difficult to understand for some groups and thus elicit agreeing responses more than positively keyed items.To simulate this, we include conditions where the ARS loading size was equal across all items (i.e., "equal ARS") as well as conditions where, for the semi(-) balanced scales, negatively keyed items had larger loadings on the ARS factor compared to positively keyed ones (i.e., "unequal ARS") For unbalanced scales, half of the items had larger loadings on the ARS factor in the "unequal ARS" conditions.
In terms of the performance of MI testing, we hypothesize the following: violations of MI (i.e., non-invariance) will likely be detected when ARS is large and ignored (i.e., not included as an additional factor for both groups).Specifically, we expect that, for balanced and semi-balanced scales, disregarding a large ARS will result in non-invariance at all levels.For unbalanced scales, violations of MI may not be detected since ARS will likely affect structural rather than measurement parameters, like the covariance among content factors in case of a multidimensional scale (D'Urso et al., 2023), or factor variances, especially in the conditions with unidimensional scales (Ferrando & Lorenzo-Seva, 2010).In addition, we expect that including the additional ARS factor for both groups will allow for MI to be established in the case of balanced and semi-balanced scales.Finally, in the conditions with unbalanced scales, we hypothesize that including an additional ARS may result in model estimation issues since ARS cannot be easily disentangled from the content factor(s).
A full-factorial design was used with 2 (number of subjects) � 3 (type of scale) � 2 (number of content factors) � 2 (number of items) � 2 (strength of ARS) ¼ 48 conditions.For each condition, 100 replications were generated, resulting in 4,800 data sets.

Data Generation
Data were generated from a factor model with one or two factors and two groups, and the model parameters are displayed in Table 2. To simulate balanced scales, for the content factor(s), half of the loadings were positive (i.e., indicative items) and the other half were negative (i.e., contra-indicative items), whereas 33% and none of the loadings were negative for semi-balanced scales and unbalanced scales, respectively.Note that, for both groups, 0 and 1 were used as generating values for the content factor(s) means and variances, respectively.As displayed in Table 2, we simulated ordinal items with 5 categories and the distance between the first threshold of the easiest and the most difficult item was 2 standard deviations.To avoid estimation issues (e.g., non-convergence), we only retained data sets where each category for each item contains at least a single observation.In the rare cases where, for a specific item, a category was not observed among the generated scores, we repeated the data generation process until all response categories were observed.
We sampled the ARS factor scores from a right-censored normal distribution to match an agreeing tendency closely.Employing this distribution, we only simulated subjects who did or did not show an ARS (i.e., have a positive or zero factor score on the ARS dimension) without allowing for scores to represent a disagreeing tendency (i.e., a negative factor score).For simulating the effect of ARS on the item responses, we used loading values of 0.3 and 0.6 for the medium and large ARS scenario, respectively 2 .Note that, for the reference group, the ARS factor scores were simulated to be 0 for all subjects (i.e., ARS did not affect the item responses).To simulate between-item-type differences in ARS loadings, in the "unequal ARS" conditions, we decreased the size of the loadings on positively keyed items compared and increased those on the negatively keyed ones, so that the average ARS loadings remained the same across groups (Table 2).Similarly, for unbalanced scales, the loadings were decreased for half of the items and increased for the other half.

2
The variance of a right-censored normal distribution is smaller than the identification restrictions imposed to set a scale for the variance of the ARS factor (i.e., fixing all its loadings to 1).In Table 2 we report the value of the original loadings on the ARS factor multiplied by the standard deviation of a right-censored normal distribution, which is �0.583.This results in loadings on the ARS factor of 0.175 and 0.350 for medium and large ARS conditions, respectively.Note that these values match those used in previous studies to simulate ARS factor loadings that can be realistically expected in welldesigned measures (Danner et al., 2015;Ferrando & Lorenzo-Seva, 2010).

Data Analysis
To simulate the effect of ignoring or including ARS when testing for MI, we considered two different MMs-and, thus, performed two different MG-CCFA analyses-for each replication, that is, with or without an additional ARS factor.For the latter, we used the standard CFA-based approach proposed by Billiet and McClendon (2000), where an additional ARS factor is specified with all loading on this factor fixed to 1.Note that no additional constraints are imposed on the content factor loadings nor on the variance of the ARS factor under this model.To identify the MG-CCFA models, we followed the Wu and Estabrook (2016) identification constraints for MI testing described in Section 2 for both the model with and without the ARS factor.
All MG-CCFA models were estimated using diagonally weighted least squares (DWLS), but the full weight matrix was used to compute the mean-and-variance-adjusted test statistics (default in lavaan; Rosseel, 2012).DWLS is a twostep estimation procedure, where the thresholds and polychoric correlation matrices for the groups are estimated in the first step, and, in the second step, the remaining parameters are estimated using the polychoric correlation matrices from the previous step.

Outcome Measures
After fitting the models, we evaluated both the convergence rate (CR) and the performance of different model fit criteria.For the latter, we recorded the results obtained from the v 2 test, the root mean square error of approximation (RMSEA; Browne et al., 1993) and the comparative fit index (CFI; Bentler, 1990) and we averaged across replications in a cell of the factorial design.In empirical practice, decisions about MI results are often dichotomous (i.e., invariant or not).Thus, we also calculated the false positive rate (FPR) for the different goodness-of-fit criteria, which is here defined as flagging the scale as non-invariant 3 .Specifically, configural non-invariance was concluded if: v 2 test was significant (a ¼ 0.05), RMSEA >0.06, CFI <0.95.In addition, since common guidelines suggest to base invariance decisions on different goodness-of-fit indices, we created a combined criterion and concluded configural invariance if both a significant v 2 -difference test and at least one between RMSEA <0.06 and CFI >0.95 was observed (Putnick & Bornstein, 2016).We compared the fit between the configural and the thresholds invariant models for thresholds invariance, between the threshold and loadings invariant models for loadings invariance, and between loadings and intercepts invariant models for intercepts invariance.For all these comparisons, non-invariance was concluded if: v 2 -difference test was significant (a ¼ 0.05), DRMSEA >0.01, DCFI < −0.01.Finally, for the combined criterion, noninvariance was concluded if we observed both a significant v 2 -difference test and at least one between DRMSEA >0.01 and DCFI < −0.01.Since we deem the results on the values of or differences in the fit indices to be more informative than these dichotomized results, we only display the latter in the Online Supplementary (Tables A10-A21).In addition, we examined the potential bias in latent mean differences when acquiescence is not accounted for in the measurement model.To achieve this goal, for each factor, we averaged the estimated latent variable mean for the focal 4 (i.e., nonreference) group j f across replications in the intercepts invariance model (see 2.2.4).Note that, this average latent mean is a direct indication of bias, since we simulated it to be zero in the data generating model.

Data Simulation, Softwares and Packages
The data were simulated and analyzed using R (R Core Team, 2013).Specifically, for estimating MG-CCFA models and obtaining fit measures, we used the R package lavaan (Rosseel, 2012), while for specifying the MG-CCFA models we used the semTools package (Jorgensen et al., 2022).Note.Loadings in bold indicates those that were negative for semi-balanced scales.The original ARS factor loadings values were 0.3/0.6 but here we display the rescaled ARS loadings obtained by multiplying the original ARS loadings values by the standard deviation of a right-censored normal distribution, which was used to generate the ARS factor scores.For ease of readability, the unequal ARS factor columns refer to the unidimensional factor model only. 3 Note that, outside of the null condition, this is not formally a FPR.In fact, only when considering the ARS factor can we really say that the scale is invariant, whereas disregarding it results in between the scales. 4 Note that we only considered the latent mean estimates for the focal group since the reference group latent means are constrained to 0 for model identification.

Without ARS Factor 3.2.1.1. Convergence.
In the Online Supplementary, Tables A4 and A5 displays the convergence results when ARS is not included as an additional factor (i.e., disregarded) for equal and unequal ARS conditions, respectively.The convergence rate was always 100% across conditions for all MG-CCFA models and for both unidimensional and multidimensional scales.Therefore, disregarding the influence of ARS when assessing different levels of MI does not seem to affect model convergence.

MI Testing.
The average fit measures results obtained when evaluating MI for unidimensional and multidimensional scales are displayed in Tables 3-6, respectively, and we display the results for the "unequal" ARS conditions in the Online Supplementary in Tables A6-A9 since they largely overlap with those for the "equal" ARS ones.The results indicate that the ARS strength and the type of scale were the most relevant design factors affecting the MI testing results.In fact, for both unidimensional and multidimensional scales, ignoring the influence of ARS deteriorated models fit at all MI levels, and especially in the conditions with large ARS and balanced or semi-balanced scales.In these conditions, the RMSEA was often >0.10 and the CFI <0.90, which, in empirical practice, are commonly interpreted as fit values.Note that, when the influence of ARS was small (i.e., k ARS ¼ 0.175), model fit was often good (i.e., RMSEA <0.06 and CFI >0.95), which is line with previous research indicating that when loadings on the ARS factor are small (i.e., �0.1) ignoring ARS does not seem to strongly affect the MM parameters recovery (Savalei & Falk, 2014).The fit measure values were good (i.e., RMSEA <0.06 and CFI >0.95) in the conditions with unbalanced scales regardless of the strength of the ARS factor and the MI level tested.Again, these results partially overlap with previous studies, indicating that the ARS factor gets absorbed by the content factors for unbalanced scales.Therefore, for unbalanced scales, one may conclude that MI holds even when one group has a strong agreeing tendency.Bear in mind that, for these scales, the bias introduced by an ARS does not seem to affect MI testing results but it may affect factor scores or factor covariances (e.g., see Savalei & Falk, 2014) for some groups, and thus (potentially) substantive conclusions.Concerning the dichotomized results (Tables A10-A12 in the Online Supplementary), almost all the considered criteria resulted in a close-to-one FPR when testing configural and intercepts invariance for balanced and semi-balanced scale, whereas for unbalanced scales the FPR was often close to 0.

Latent mean differences.
The average bias in estimated latent mean difference in function of the different conditions when ARS is ignored are displayed in Table 7.
Overall, the bias is especially large in the conditions with unbalanced scales (�0.19 and 0.35 for small and large ARS, respectively).This is likely due to the fact that, in these conditions, all items introduce bias in the same direction (i.e., positive) since all items were positively-keyed.In contrast, in the conditions with balanced scales and equal ARS, positively and negatively keyed items bias the latent mean in opposite directions, thus canceling out one another and resulting in nearly unbiased estimates of the latent means.
In the unequal ARS conditions, with larger ARS loadings for negatively-keyed items, the biases of the positively-and negatively-keyed items do not completely cancel out, resulting in negatively biased latent means for the balanced scales.This also explains why, for semi-balanced scales (with only one-third of negatively-keyed items), the bias is positive in the equal ARS conditions and nearly zero in the unequal ARS conditions, since the bias result from the positivelykeyed items is only canceled out completely when the (fewer) negatively-keyed items get larger loadings.Finally, note that, given that latent variables are standardized, the latent mean for the focal group can be interpreted as Cohen's d, thus indicating that disregarding ARS erroneously leads to small to moderate standardized mean differences across groups.

With ARS Factor 3.2.2.1. Convergence.
Table 8 5 displays the model convergence results when including an additional ARS factor in the MM.Convergence was strongly affected by the type of scale.In fact, for unbalanced scales, the convergence rate was lower than in the conditions with (semi-) balanced scales and especially low when testing for configural invariance.Therefore, one may often fail to evaluate configural invariance when including ARS for an unbalanced scale in empirical practice.Note that this is likely caused by the fact that the ARS factor cannot be distinguished from the content factor(s), which is corroborated by previous research indicating that, in EFA, the additional ARS factor for unbalanced scales is not captured when selecting the number of factors (D'Urso et al., 2023); Ferrando & Lorenzo-Seva, 2010; The convergence rate is a lot higher for the higher levels of invariance, however, which leaves possibilities to scrutinize measurement (non-) invariance at these levels.

MI Testing.
Tables 9-12 display the MI testing results when an additional ARS factor is included in the MM for unidimensional and multidimensional scales, respectively.We display the results for the unequal ARS conditions in the Online Supplementary in Tables A15-A18 as they largely overlap with those for the equal ARS conditions.The average fit measures results indicate that, for both unidimensional and multidimensional scales and for all MI levels tested, including the additional ARS factor yields good to perfect fit according to all fit measures regardless of the other design factors.For the dichotomized results (Tables A19-A22 in the Online Supplementary), almost all the considered criteria resulted in a close-to-zero FPR when testing MI at all levels. 5 The results for the unequal ARS conditions largely overlap with those displayed in this table, thus we report them in the Online Supplementary on Table A14.

Latent Mean Differences.
Table 13 displays the bias in the estimated latent mean difference when ARS is included as an additional factor in function of the simulated conditions.The results show that the bias is negligible for (semi-) balanced scales, which adds to the benefits of working with (semi-) balanced scales.For the unbalanced scales, the latent mean difference appeared to be highly distorted.
These distortions were mostly due to inadmissible solutions with negative factor variances, signaling model identification issues.Thus, we calculated the bias only for those models that did not result in improper solutions.The results showed not only that many models resulted in improper solutions, but also that, for the ones with admissible solutions the latent mean difference were not negligible.Hence, for unbalanced scales, ARS still distorts conclusions for latent mean differences even when it is explicitly modeled, and (semi-) balanced scales should be preferred to accurately recover latent mean differences in content factors across groups.

Conclusions
The simulation study assessed the effect of disregarding and including an additional ARS factor on MI testing when responses in one group are affected by an ARS.The results showed that not taking a strong (unequal) ARS into account resulted not only in wrongly concluding there is measurement non-invariance for balanced and semi-balanced scales but also in biased estimated latent means.In fact, for these scales, model fit heavily deteriorated for all MI levels, and thus one may conclude that the content factor(s) MM differs across groups while this is purely due to a strong agreeing tendency in one group.Note that this result is significant for empirical practice, where researchers that follow the standard CFA-based MI testing approach may conclude that configural invariance does not hold and try to modify the MM to be able to compare the groups or, in the most extreme case, refrain from further analyses.In the balanced and semi-balanced scales conditions, this issue was solved by including, for all groups, an additional ARS factor with all its loadings fixed to 1, which resulted in concluding that MI held at all levels, and in an accurate recovery of latent mean differences.For unbalanced scales, disregarding ARS (i.e., not including ARS as an additional factor) did not affect the MI testing results, which indicated that MI held at all levels.This latter result partially overlaps with previous research, showing that ARS gets absorbed by the content factor(s) in unbalanced scales (D'Urso et al., 2023).
Ferrando & Lorenzo-Seva, 2010; However, for these scales, the disregarded ARS resulted in considerable bias in estimated latent mean differences.Hence, even though ARS does not influence MI testing in case of unbalanced scales, it may still lead to wrongly concluding that latent means differ across groups, while this is purely due to a disregarded ARS.Thus, for unbalanced scales, this is a possibility that one should take into account.Including an additional factor to capture ARS is not a solution for unbalanced scales since this often led to model non-convergence, especially when testing for configural invariance, likely due to the indistinguishability of the ARS factor from the content factor(s).Further, it either resulted in improper solutions when testing for intercepts invariance or it did not allow to accurately recover latent means, thus indicating that, when ARS is at play, group differences may be heavily misjudged.

Discussion
In psychological science, self-report scales are widely used to compare targeted latent constructs (e.g., depression) across groups.To draw valid and unbiased conclusions concerning latent construct differences, one must ensure that the selfreport scales used to measure these constructs function equivalently across groups.The latter is often assessed through measurement invariance (MI) testing, which evaluates the tenability of the hypothesis of measurement model (MM) equivalence.For scales composed of ordinal items, MI is often tested through multiple group categorical confirmatory factor analysis (MG-CCFA), which allows evaluating MM parameters' equivalence across groups in a stepwise fashion.In addition to the scale itself being inequivalent, non-invariances may emerge when disregarding the influence of an agreeing response style (ARS), which represents a tendency to agree with items regardless of their content (Paulhus, 1991).Though it is known that certain groups may be particularly prone to ARS (i.e., see Van Vaerenbergh & Thomas, 2013 for a review), and that such response tendency can bias MM parameters in single group studies (Ferrando & Lorenzo-Seva, 2010)  including it as an additional factor in the measurement model (MM) with all its loadings fixed to 1.One of the more significant findings from this study is that ignoring a large ARS resulted in measurement noninvariance at all levels and biased latent mean differences for balanced and semi-balanced scales, which was solved by including an additional factor capturing ARS.Therefore, when using (semi-) balanced scales, researchers should bear in mind that configural non-invariance and artificial differences in latent means may result from disregarding a large ARS, and that including an additional ARS factor in the MM for all groups is an effective way to correct for this.In this way, researchers can ascertain that there is no need to look for or remedy inequivalences that pertain to the scale.For unbalanced scales, disregarding an ARS did not affect MI testing results, and including an additional ARS factor was not advantageous since it often led to model non-convergence.This is likely due to the fact that, for unbalanced scales, the intended-to-be-measured (i.e., content) factors cannot be easily distinguished from the ARS factor.Nevertheless, for these scales, one should not conclude that ignoring an ARS is harmless because estimated latent mean differences are heavily biased (potentially in addition to bias in the factor correlations; Savalei & Falk, 2014;de la Fuente & Abad, 2020;D'Urso et al., 2023), and thus affect substantive conclusions.Sadly, in practice, the bias due to neglecting an ARS (e.g., in factor correlations and factor scores) may not be detected when testing for MI using unbalanced scales, and the correction proposed by Billiet and McClendon (2000) is not a solution.Therefore, in settings where ARS could be present, using unbalanced scales is inherently problematic as they do not allow one to correct for this response tendency.
Taken all together, these results indicate that ARS is a serious threat to MI testing results and that (semi-)balanced scales should be preferred when suspecting that an ARS may be at play for specific groups.Using balanced or semibalanced scales is not always straightforward, however.For instance, negatively worded items often require higher reading levels or intellectual capacity that cannot be assumed for certain (e.g., clinical) populations (Chyung et al., 2018).In these cases, one may consider using specific "marker" items or scales tailored to measure ARS, but further research is needed to evaluate the feasibility of this approach in the context of multidimensional scales and multiple group models (Ferrando et al., 2016).Alternatively, model non-convergence and improper solutions may be solved by bounded estimation (De Jonckere & Rosseel, 2022), where data-driven upper and lower bounds for model parameters can be specified prior to estimating the model (e.g., setting the lower bound of the ARS factor variance to a non-negative number), is a promising solution that has not yet been evaluated for MG-CCFA.
Our simulation study is subject to a few limitations that are worth noting.First, ARS was the only considered source of bias that, when disregarded, yielded us to conclude that there is non-invariance.However, in practice, it is reasonable to expect that other, scale-specific sources of noninvariance, such as differential item interpretation, may affect individual responses in some (or all) groups.In the future, it would be interesting to extend the current simulation study to evaluate whether non-invariance due to disregarding an ARS may be disentangled from other noninvariances such as specific factor loading differences for the content factors.However, this CFA-based approach can also fall short (i) when items load on more than one factor at the same time (i.e., cross-loadings) and (ii) when researchers are interested in assessing (between-group differences in) the ARS factor loadings.One may consider using multiple group exploratory factor analysis (MG-EFA;J€ oreskog, 1970) to overcome these limitations.MG-EFA does not impose an assumed structure on the factor loadings and thus can easily capture cross-loadings.Furthermore, in MG-EFA, one does not need to assume that the influence of ARS is equal across items (i.e., the loadings on the ARS factor do not need to be constrained to be 1).In fact, by using a (semi-)specified target rotation, one may estimate the loadings of the additional ARS factor-that is, by specifying (part of) the rotation target according to a priori expectations on the MM while leaving the ARS factor loadings unspecified (D'Urso et al., 2023).Second, ARS was assumed to affect the responses in only one of the groups, while, in practice, the responses in all groups may be influenced by ARS but to a different extent (e.g., ARS loadings may be higher in one group).In those cases, the CFA-based approach discussed can also be applied to test for MI, and its outcome (i.e., rejecting MI or not) will depend on the differences in ARS across groups.
Third, we considered simulation scenarios with only two groups.MI testing has become increasingly relevant for cross-cultural and cross-national research, where large data sets with many groups are the norm (Rutkowski & Svetina, 2017).Hence, future research should evaluate the extent to which disregarding ARS or not affects MI testing when this agreeing bias influences responses only for a subset of groups or when it gradually differs across all groups.
Fourth, we followed a scale-based MI testing framework, but alternative approaches, such as item-based analyses (e.g., multiple group item response theory; D' Urso et al., 2022), may also be of interest, especially when the bias caused by an ARS affects some items more than others or when trying to distinguish ARS from specific differences in the content factor loadings.
Fifth, we use standard cut-off values for describing our results based on known guidelines (e.g., Cheung & Rensvold, 2002).However, these cut-off values have been criticized due to their lack of generalizability beyond the models used to determine these values in the first place.Therefore, alternative approaches to determine model-specific cut-off values have been proposed, such as: (1) Dynamic fit indices (McNeish & Wolf, 2023) and (2) equivalence testing procedures (Finch & French, 2018;Marcoulides & Yuan, 2017;Yuan et al., 2016).However, these alternatives are not yet readibly applicable to the conditions evaluated in this paper, since the former is still limited in its generalization to measurement invariance (MI) testing, while the latter is limited to continuous, normally distributed items.In the future, once these limitations are mitigated, it may be interesting to re-assess our conclusions' generalizability to alternative cut-off values.
Nevertheless, the present study is the only thorough investigation of the effect of ARS on MI testing.We showed that correcting for agreeing bias when testing for MI allows to determine that the scale is invariant otherwise.We expect this outcome to be tremendously valuable in empirical practice as this avoids unnecessary worries about and investigations of scale non-equivalence (e.g., looking for noninvariant items).

Table 1 .
Identification and MI constraints Wu and Estabrook (2016)for MI testing with MG-CCFA.

Table 2 .
Population values for the simulation study.

Table 3 .
Average fit value for MI testing when the ARS factor is not included for unidimensional scales in function of the simulated conditions with equal ARS.
Note.ARS: ARS factor loadings; N: sample size within each group; Bal: balanced scale; Semi: semi-balanced scale; Unbal: unbalanced scale; J: number of items.Values in bold indicate conditions where the average model fit results were below commonly accepted cut-off values for "good" fit.

Table 4 .
Average fit value for MI testing when the ARS factor is not included for unidimensional scales in function of the simulated conditions with equal ARS.

Table 5 .
Average fit value for MI testing when the ARS factor is not included for multidimensional scales in function of the simulated conditions with equal ARS.
Note.ARS: ARS factor loadings; N: sample size within each group; Bal: balanced scale; Semi: semi-balanced scale; Unbal: unbalanced scale; J: number of items.Values in bold indicate conditions where the average model fit results were below commonly accepted cut-off values for "good" fit.

Table 6 .
Average fit value for MI testing when the ARS factor is not included for multidimensional scales in function of the simulated conditions with equal ARS.
Values in bold indicate conditions where the average model fit results were below commonly accepted cut-off values for "good" fit.

Table 9 .
Average fit value for MI testing when the ARS factor is included for unidimensional scales in function of the simulated conditions with equal ARS.Note.ARS: ARS factor loadings; N: sample size within each group; Bal: balanced scale; Semi: semi-balanced scale; Unbal: unbalanced scale; J: number of items.

Table 10 .
Average fit value for MI testing when the ARS factor is included for unidimensional scales in function of the simulated conditions with equal ARS.
, this is the first paper thoroughly evaluating the effects of ARS on MI testing.Determining if disregarding ARS can appear as measurement non-invariance may help to ascertain how to correct for bias in latent construct differences due to differential response tendencies across groups.In fact, it is superfluous to look for scale-specific causes of non-invariance if these are entirely due to ARS.Instead, including an extra factor to model the ARS corrects for the bias.In this paper, we conducted a simulation study to evaluate in what conditions and to what extent an ARS affecting the individual responses in one of the groups is detected as measurement non-invariance, both when disregarding ARS and when

Table 11 .
Average fit value for MI testing when the ARS factor is included for multidimensional scales in function of the simulated conditions with equal ARS.
Note.ARS: ARS factor loadings; N: sample size within each group; Bal: balanced scale; Semi: semi-balanced scale; Unbal: unbalanced scale; J: number of items.

Table 12 .
Average fit value for MI testing when the ARS factor is included for multidimensional scales in function of the simulated conditions with equal ARS.