Validating measurement tools for mentalization, emotion regulation difficulties and identity diffusion among Finnish adolescents

Abstract Mentalization, emotion regulation, and identity diffusion are theoretically and clinically important transdiagnostic psychological constructs that contribute to mental health. In order to advance meaningful empirical research on these constructs, we need measures that are well tested. In this study, we used confirmatory factor analysis to assess the reliability and construct validity of the Mentalization Questionnaire (MZQ), different versions of the Difficulties in Emotion Regulation Scale (DERS), and the Assessment of Identity Development and Identity Diffusion in Adolescence (AIDA) with data from a general population of Finnish adolescents (N = 360). For MZQ, the factor structure and validity of the subscales were not confirmed. For DERS, a short version, that did not include the lack of emotional awareness subscale was the most coherent and recommendable version of the measure, with a good degree of reliability and a reasonable indication of the convergent and discriminative validity between the subscales. For AIDA, the factor structure was confirmed, but when using this measure for research purposes, it should be taken into account that reverse coding items may affect the factor structure by creating a method factor. The reliability of the AIDA was acceptable, but some of the subscales showed poor convergent and discriminative validity.


Measures and their validity
In previous research, mentalization has been measured using the observer-rated reflective functioning (RF) scale (Fonagy et al., 1998). Unfortunately, the RF scale is a complex, timeconsuming instrument and is, therefore, difficult to use in everyday clinical practice, or for research purposes with larger datasets. To address this issue, Hausberg et al. (2012) developed a self-reporting questionnaire in order to assess mentalization called the Mentalization Questionnaire (MZQ). The authors noted that using a self-reporting measure for assessing a skill like mentalization is somewhat problematic, but that an individual's attitude toward mentalization might be directly related to their actual ability.
The MZQ has not yet been validated against the more common observer-rated measures, but Hausberg et al. (2012) have stated that patients with suicide attempts, self-injurious behavior, diagnosis of BPD, or multiple diagnosis score lower on MZQ than other patients, and that patients with secure attachment pattern score higher than other patients. They also describe MZQ having a satisfactory internal consistencies (.81 for full scale and .54 to .72 for subscales). A recent study found that reduced mentalization in adolescents measured with MZQ was associated with depression and risk behaviors such as binge drinking, and that the individual's level of mentalization ability mediated the association between childhood traumatic experiences and depression (Belvederi Murri et al., 2017). The Finnish version (Kein€ anen et al., 2019) of the MZQ was used successfully to assess the changes in mentalization during mentalization-based group therapy among university students with pervasive ER difficulties (Kein€ anen et al., 2017).
ER has a range of different well-established measures. In particular, the Difficulties in Emotion Regulation Scale (DERS; Gratz & Roemer, 2004) is widely used, has good psychometric qualities, and is reliable and valid in adolescents (Neumann et al., 2010). A Finnish translation (Tapola et al., 2010) has been used with adults, but normative data for adolescents is not available.
The original DERS includes 36 items, with six subscales: lack of emotional awareness, lack of emotional clarity, difficulties controlling impulsive behaviors when distressed, difficulties engaging in goal-directed behavior when distressed, nonacceptance of negative emotional responses, and limited access to effective ER strategies. However, Bardeen et al. (2012) found that the lack of emotional awareness subscale does not correlate well with other scales and, thus, suggested a new version of the measure without the subscale (DERS-R). In order to reduce the burden for respondents, the DERS short form (DERS-SF), including half of the original items, was recently introduced. Kaufman et al. (2016) have stated that shorter version shows excellent psychometric properties together with the original structure, and that the factor structure of the measure was more coherent with short form than the original version. Goth et al. (2012) developed a measure to assess identity diffusion in adolescence called the Assessment of Identity Development and Identity Diffusion in adolescence (AIDA). AIDA is meant to identify the pathological features in adolescent identity that are considered central in personality disorders. The AIDA model differentiates two dimensions of identity: continuity and coherence. The continuity dimension has three subdimensions: stability in attributes, stability in relations, and positive emotional self-reflection. The coherence dimension also has three subdimensions: consistent self-image, autonomy, and positive cognitive self-reflection. Goth et al. (2012) describe very good scale reliabilities (.94 for total scale, .86 and .92 for main dimensions and from .76 to .86 for subdimensions) to AIDA. Jung et al. (2013) have also demonstrated empirically that AIDA can differentiate adolescents with personality disorder from the general population, as well as from adolescents with other types of psychiatric problems. They argue that identity diffusion, as defined in AIDA, is a distinguishing mark of personality disorder, not only psychiatric impairment in general.
When measuring psychological constructs, such as mentalization, ER and identity, the construct validity of the measures should be addressed. Construct validity is seen as an overarching term that includes all other forms of validity. It refers to the extent to which a measure assesses the construct it is supposed to assess (Strauss & Smith, 2009). Campbell and Fiske (1959) introduced the terms convergent and discriminant construct validity. Convergent validity refers to the degree in which two concepts, that theoretically should be connected, are actually related. Discriminant validity is the degree to which constructs that are theoretically distinct, are in fact, unrelated Nowadays, an often-used method for examining the construct validity of a measure is confirmatory factory analysis (CFA). CFA allows a comparison of alternative theoretical measurement models at the latent factor level and helps reduce measurement error. It can, therefore, be used to confirm the substructure of the measure (Atkinson et al., 2011;John & Benet-Mart ınez, 2000;McArdle, 1996).

Aim of the study
This study aimed to evaluate the construct validity and reliability of DERS, MZQ, and AIDA when used with a normative sample of Finnish adolescents. As alternative versions of DERS exist, the study evaluated the construct validity and reliability of the different versions.

Mentalization
MZQ (Hausberg et al., 2012) is a 15-item self-reporting measure consisting of four subscales: refusing self-reflection, emotional awareness, psychic equivalence mode, and regulation of affect. Other subscales consist of four items, but regulation of affect -subscale has three items. For each item, participants choose using five-point likert scale how much they agree or disagree with the item. Hausberg et al. (2012) ascribe the MZQ with satisfactory internal consistencies and good construct validity but recommend using the total score instead of the subscales before further validation. The

Emotion regulation
DERS (Gratz & Roemer, 2004) is a self-report measure comprising 36 items and six subscales, labeled as lack of emotional awareness (AWARENESS), lack of emotional clarity (CLARITY), difficulties controlling impulsive behaviors when distressed (IMPULSE), difficulties engaging in goal-directed behavior when distressed (GOALS), nonacceptance of negative emotional responses (NONACCEPTANCE), and limited access to effective ER strategies (STRATEGIES). Participants choose using five-point likert scale how often the item applies to respondent. Items on one subscale ranges from 5 to 8. The Finnish translation was conducted by Tapola et al. (2010).

Identity diffusion
AIDA (Goth et al., 2012) is a 58-item self-report measure that considers two dimensions of identity: continuity and coherence. Number of items on main dimensions are 27 and 31. The continuity dimension has three subdimensions: stability in attributes, stability in relations, and positive emotional self-reflection. Coherence has also three subdimensions: consistent self-image, autonomy, and positive cognitive self-reflection. Subdimensions consists of 7 to 12 individual items. For each item, the respondent reports on a five-point scale how well it describes the respondent. The translation of the AIDA was conducted as a part of this study. According to the requirements of the measure's developers it was translated Finnish and back to English and approved by the AIDA research group.

Participants and procedure
The participants in this study were 402 high school students from four different schools in three different cities in Finland. During regular school hours, the teacher instructed students to complete all three measures anonymously using an internet-based form. Participation was voluntary, and all parents received an electronic information letter before their child participated. Students who did not want to participate were instructed to submit an empty form. The form also included a question of conscience at the end, where participants were asked if they answered the questions honestly. Before any statistical analyses, all participants who had sent an empty form (four) or who answered the question of conscience negatively (38) were removed from the data. The final sample included 360 students (14-21 years, M ¼ 16.39). Of these participants, 228 (63.3%) were female and 132 were male (36.7%).
For CFA, a minimum requirement for sample size have been suggested to be 100-200, but adequate sample size is shown to be highly context-dependent (Wolf et al., 2013). We tried to maximize our limited resources for data collection and collect as much data as possible.

Statistical analyses
The Statistical Package for Social Sciences (SPSS 26 for Windows) was used for the analysis of descriptive statistics. We used independent sample t-tests to assess the differences between genders in the means of the different scales and subscales. The factor structure of the measures was assessed using CFA. In order to handle the variable non-normality, a robust maximum likelihood estimator was used. To assess missing data mechanism, we first applied Little's MCAR test. We further examined the missing data mechanism variable by variable by using t-tests to assess if respondents with missing data differ from respondents without missing data. The full information maximum likelihood method (FIML), without including auxiliary variables was used to handle the missing data, as suggested by Little et al. (2014) when proportion of the missing data is not very high. All models were tested using the R environment for statistical computing (R Development Core Team, 2008) and the lavaan package (Rosseel, 2012).
For MZQ, we assessed two different measurement models. The authors of the measure (Hausberg et al., 2012) introduced a four-factor model but suggested using a one-factor MZQ before further validation. In line with their suggestion, we first used the one-factor model where all the items were set to load on a single factor. The second model included four factors. Because the two models were nested, we tested the chi-square difference of the models to compare their fitness to the data.
For DERS, we assessed four different measurement models. First, we considered the original full-length six-factor model suggested by Gratz and Roemer (2004). Second, we assessed the full-length five-factor DERS without the AWARENESS scale (DERS-R), a model suggested by Bardeen et al. (2012). Third, we looked at the shortened version of the original DERS (DERS-SF), which included six factors and was developed by Kaufman et al. (2016). Fourth, we evaluated a measurement model that included combined alterations to the original DERS; it was a shortened five-factor version of the DERS without the AWARENESS scale (DERS-R-SF).
For AIDA, we applied a higher order CFA using a measurement model consisting of two main scales (identity continuity vs. discontinuity and identity coherence vs. incoherence). Both higher order factors included three subscales. Continuity vs. discontinuity included the subscales stability in attributes, stability in relations, and positive emotional self-reflection. Coherence vs. incoherence included the subscales of consistent self-image, autonomy, and positive cognitive self-reflection. This model was the original AIDA model developed by Goth et al. (2012).
Because incremental fit indices compare the user measurement model against a supposedly poorly fitting baseline model, they might be misleading if the baseline model fits the data exceptionally well. It has been suggested that if the RMSEA of the baseline model is smaller than.158, the incremental goodness-of-fit indices may not be reliable (Kenny, 2015;Kenny et al., 2015). To address the issue with mixed results with goodness-of-fit indices of AIDA-models, we checked the root mean square error of approximation (RMSEA) of the baseline null model.
Four commonly recommended fit indices were used to evaluate the model fit: The Tucker-Lewis Index (TLI), the Comparative Fit Index (CFI), the RMSEA, and the Standardized Root Mean Square Residual (SRMR). The following guidelines were considered as an indication of a good fit: TLI close to .95, CFI close to .95, RMSEA close to .06 with 90% confidence interval with an upper limit .08, and SRMR close to .08 (Hooper et al., 2008;Hu & Bentler, 1999).
Scale reliability was evaluated using the composite reliability coefficient omega (McDonald's x), which is considered to provide a more accurate approximation of scale reliability than other measures (Revelle & Zinbarg, 2009). For an estimate of convergent validity, the value of the average variance extracted (AVE) was considered. To indicate an acceptable convergent validity, the AVE value should exceed .50. To assess the latent variable discriminant validity, we compared the AVE value to the shared variance of the latent variables. It is considered, that for any two constructs to indicate discriminant validity, the AVE value of both constructs should be larger than the shared variance (squared correlation) of the two constructs (Fornell & Larcker, 1981).

Descriptive statistics
The means and standard deviations, Cronbach's alphas, and ranges of corrected item-total correlations for MZQ, DERS, and DERS-SF total scales and subscales are summarized in Table 1. The mean scores for MZQ were lower for girls (indicating more reported mentalization difficulties) for all the scales except refusing self-inspection. For DERS, the mean scores were significantly higher for girls (indicating more reported ER difficulties) on all scales except for AWARENESS.
For AIDA, the proportion of missing data was totally 3.1%, ranging from 0.6 to 7.5% on individual variables. For DERS, range of missing data on single variable was 2.2-8.9%, and total proportion of missing data was 6.0%. For MZQ the data was missing in 8.4% of the datapoints, ranging from 5.3 to 10.3% on an individual variable. Little's MCAR test was insignificant for AIDA, giving us no reason to assume data from AIDA would not be missing completely at random (MCAR). For DERS and MZQ Little's MCAR test showed that missing data was not MCAR. However, the T-tests showed that for almost all variables for DERS and MZQ, respondents who had missing data did not differ from respondents without missing data. The only exceptions (p < .05) were items number 8 and 11 for MZQ, items 35 and 36 for DERS and item number 8 for AIDA. Therefore, concerning DERS and MZQ the data was assumed to be missing at random.
For DERS-SF, the AWARENESS and IMPULSE scores for boys and girls did not differ significantly; for all other scales, the girls had significantly higher scores than the boys. Correlations for DERS and the corresponding DERS-SF scales were all significant, with .84 for the total scale, .88 for AWARENESS, .91 for CLARITY, .96 for IMPULSE, .96 for GOALS, .97 for NONACCEPTANCE, and .93 for STRATEGIES.
The means and standard deviations, Cronbach's alphas, and ranges of corrected item total correlations for the AIDA main scales and the subscales are summarized in Table 2. For AIDA, the girls' mean scores were higher (indicating more identity problems) than the boys, but for the stability in relations subscale, the difference was not significant.

One-factor model
The one-factor measurement model for MZQ provided poor fit for the data: v2(90) ¼ 317.69, p < .001, CFI ¼ .829, TLI ¼ .801, RMSEA ¼ .092 (90% confidence interval .082-.104), SRMR ¼ .062. SRMS met the expected guideline, but other goodness-of-fit indices were quite far from expected guidelines. The composite reliability estimate for one factor was acceptable (x ¼ .87), but the AVE value failed to show an indication of adequate convergent validity (AVE ¼ .32).

Four-factor model
The original four-factor measurement model suggested by Hausberg et al. (2012) provided a significantly better fit for the data than the one-factor model: v2 (6)¼ 35.96, p < .001. However, even though the four-factor model fit the data better than the one-factor model, the overall fit of the four-factor model was not adequate: v2(84) ¼ 271.85, p <.001, CFI ¼ .865, TLI ¼ .831, RMSEA ¼ .085 (90% confidence interval .074-.096), SRMR ¼ .063. SRMS met the expected guideline, and RMSEA was close to the guideline, but CFI and TLI did not indicate a good fit.
Composite reliability was acceptable for all factors except refusing self-inspection. AVE values, however, did not indicate adequate convergent validity for any factor or for the overall scale (see Table 3). Comparisons between squared correlations and AVE values failed to indicate discriminant validity between factors, as all the AVE values were lower than the lowest shared variance (squared correlation) between the constructs.

Exploratory model
As neither model was adequately confirmed, we continued using CFA with modification indices and by allowing cross loadings to explore an alternative measurement model that would fit the data better. We examined from modification indices which parameters seemed to cause largest model misfit and allowed one item at time to load on two factors, the factor suggested by original model and the factor suggested by the modification indices. When the item seemed to load clearly better on the alternative factor, we re-specified the model accordingly and checked the model fit and the modification indices after the modification and repeated these steps, until the model fit was acceptable. This step-by-step approach led us to move item 9 from refusing self-reflection to emotional awareness -factor, item 15 from emotional awareness to regulation of affect -factor, and item 2 from regulation of affect, items 4 and 7 from psychic equivalence mode and item 8 from emotional awareness to refusing self-reflection -factor. We renamed emotional awareness as F1, psychic equivalence mode as F2, regulation of affect as F3 and refusing self-reflection as F4 (see Table 5). The alternative four-factor exploratory model fit the data well, with v2(84) ¼ 171.22, p <.001, CFI ¼ .950, TLI ¼ .930, RMSEA ¼ .057 (90% confidence interval .045-.070), and SRSM ¼ .043. All goodness-of-fit indices were fulfilled or were close to expected guidelines.
All factors showed acceptable reliability. The values for AVE indicated a convergent validity for factors F1, F2, and F3, but was lower than expected for F4. Total AVE was slightly Mentalization, emotion regulation and identity lower than expected (see Table 4). Factor correlations were all significant and ranged from 0.56 to 0.82. Comparisons between AVE and squared correlations indicated discriminant validity for F1 from F2, and F2 from F3. Single item factor loadings ranged from .55 to .86 for F1, from .74 to .84 for F2, from .75 to .78 for F3, and from .50 to .66 for F4 (see Table 5).

Full-length DERS
The full-length DERS model with six factors did provide a somewhat acceptable, but not good fit for the data: v2 (579)    RMSEA was close to the guideline, but other goodness-of-fit indices failed to show adequate fit to the data. All factors showed acceptable reliability. The values for AVE failed to show an indication of convergent validity for AWARENESS and CLARITY, but the values for IMPULSE, GOALS, NONACCEPTANCE, STRATEGIES, and total AVE were acceptable (see Table 6). Factor correlations ranged from À.05 to .82. AWARENESS correlated significantly only with the CLARITY factor. For the other factors, intercorrelations were high. Comparisons between AVE values and squared correlations between factors indicated discriminant validity for AWARENESS from all factors, and for CLARITY from GOALS, IMPULSE from GOALS and NONACCEPTANCE, and for GOALS from NONACCEPTANCE. Single item factor loadings ranged from .44 to .76 for AWARENESS, from .53 to .70 for CLARITY, from .33 to .86 for IMPULSE, from .42 to .87 for GOALS, from .66 to .84 for NONACCEPTANCE, and from .36 to .80 for STRATEGIES (see Table 7).
Indication for overall reliability was good, with x ¼ .96. Indication of convergent validity was shown for the overall scale and for all other factors except CLARITY (see Table 6). Correlations between factors ranged from .57 to .82, and comparisons of AVE values and squared correlations between factors indicated discriminant validity for GOALS from CLARITY and for IMPULSE from NONACCEPTANCE. Factor loadings for single items ranged from .48 to .73 for CLARITY, from .33 to .86 for IMPULSE, from .43 to .86 for GOALS, from .65 to .84 for NONACCEPTANCE, and from .36 to .80 for STRATEGIES (see Table 7).

Short version of DERS (DERS-SF)
The measurement model for the short version of DERS provided a good fit for the data: v2(120) ¼ 186.62, p <.001, CFI ¼ .974, TLI ¼ .966, RMSEA ¼ .044 (90% confidence interval .031-.056), SRSM ¼ .036. All goodness-of-fit indices met the expected guidelines. The DERS-SF AWARENESS factor failed to show adequate reliability, but for all other factors, composite reliability was acceptable (see Table 6). AVE values showed an indication of convergent validity for CLARITY, IMPULSE, GOALS, NONACCEPTANCE, and STRATEGIES  Table 7).

Short version of DERS without AWARENESS-factor (DERS-R-SF)
A measurement model for DERS-R-SF fit the data well: v2(80) ¼ 144.40, p <.001, CFI ¼ .971, TLI ¼ .962, RMSEA ¼ .055 (90% confidence interval.041-.070), SRMR ¼ .034. All goodness-of-fit indices met the expected guidelines. All factors indicated acceptable reliability, and AVE values showed an indication for convergent validity for all factors (see Table 6). Correlations between factors ranged from .52 to .79, and comparisons between AVE values and squared correlations showed indication for discriminant validity for CLARITY from GOALS and NONACCEPTANCE, for IMPULSE from GOALS, NONACCEPTANCE and STRATEGIES, and for GOALS from NONACCEPTANCE and STRATEGIES. Factor loadings for all items were above .71 (see Table 7).

Identity integration
Original AIDA-model  would not be useful; absolute indices would be more credible, and the model would fit the data. All factors showed an indication for acceptable reliability, but none of the AVE values showed an indication for convergent validity for the subscales (see Table 8). The correlation between the two higher order scales was significant (.96). The correlations between the lower order subfactors were all significant ranging from .56 to .82. Comparisons between AVE values and squared correlations failed to indicate discriminant validity between the subscales. The standardized factor loadings for the lower-order subscales ranged from .10 to .67 for stability in attributes, from .34 to .80 for stability in relations, from .51 to .77 for positive emotional self-reflection, from .34 to .79 for consistent self-image, from .37 to .75 for autonomy, and from .50 to .70 for positive self-reflection. For higher-order factors, the standardized loadings of second-order factors were .62 to .97 for discontinuity and from .93 to .98 for incoherence.

Explorative model with a method factor
Even if the original model fit the data, the fact that the first subscale (stability in attributes) consisted of many of the reverse-coded items raised the possibility that the subscale would have features of a method factor. For evaluating possible method factor effects, we constructed an explorative alternative model for testing. The explorative model included two main scales like the original model, but the first subscale (stability in attributes) was removed, and the corresponding items were divided into two subscales on the continuity main scale. This was done by allowing individual items to cross load on both subscales and reviewing from factor loadings and modification indices on which of the subscales seemed more appropriate for individual item. Since the AIDA model is a higher order model, where all the items on stability of attributes -subscale also belong on the higher order main scale (continuity), we found it also theoretically justifiable to assume that model could be specified with these items on different subscales on the continuity main scale. The fact that subscales correlate highly with each other also affirms this. The explorative model also incorporated a method factor that included all reverse-coded items. For the method factor,  2R,5R,8,9,10,18,23R,28,39R,40,41R,43R,54,55,58R Emotional 3,11,17R,19,24,26R,27,29,30,33R,44 Incoherence Consistent 4,12,13,15,25,31,32,45,47,56R,57 Autonomy 142,021,223,436,384,246,485,053 Cognitive 6,71,63,53,74,95,152 Method factor 1R, 2R, 5R, 17R, 23R, 26R, 33R, 39R, 41R, 43R, 56R, 58R AIDA, Assessment of Identity Development in Adolescence; R ¼ reverse coding item.
correlation with other factors was set to zero. Details of the individual items on different scales and subscales of the explorative model are shown on Table 9. We tested the chi-square differences between the two models to compare their fit to the data. Goodness-of fit indices showed mixed results for the explorative model: v2 (1577)

Discussion
This study aimed to evaluate the construct validity and reliability of measures for mentalization, ER difficulties, and identity diffusion when used in a normative sample of Finnish adolescents. Finding adequate tools to measure these transdiagnostic constructs will have a great clinical importance. Paradigm for assessing and diagnosing personality disorders has lately been shifting from categorical to dimensional approach (Mulder & Tyrer, 2019). BPD has been suggested to capture impairment in personality functioning in more general (Goth et al., 2012;Mulder & Tyrer, 2019), and reliable and valid measures of concepts central in different treatments for BPD have potential for use in clinical practice for assessing features of personality functioning beyond symptoms and categorical diagnoses for improving functionality of assessment in planning and monitoring treatment, as well as bringing scientific and clinical, and psychological and psychiatric points of view more together with each other.

Mentalization
The original validation for MZQ was conducted in an adult clinical sample, but our results were similar to a nonclinical sample of Italian adolescents (Belvederi Murri et al., 2017); the data mean scores and standard deviations in the two studies were similar. In our sample, the female participants reported more difficulty with mentalization overall and more difficulty on the subscales than the male participants. Previous studies have not reported data separately for males and females, so we cannot compare our data. Differences between genders might reflect that MZQ captures difficulties in mentalization more typical for females than for males, for example momentary failures resulting from intensive emotions, and that difficulties more typical for males would not be so well recognized with MZQ. Since the females in our sample, as in previous studies, reported more difficulty in ER and identity, it seems plausible that MZQ grasps in particular such difficulties in Mentalization which correlate with problems in ER and identity. Previous studies in general population using observer-related measures such as RF scale, have reported that adult women seem to have higher levels of mentalization compared on men (Jessee et al., 2016). Differences between genders among adolescents have not been found (Chow et al., 2017). These findings would suggest that MZQ indeed captures different aspects of mentalization than RF scale, or that girls would be more aware of their problems with mentalization than boys in our sample. This highlights the need for MZQ validation studies against observer rated measures, as suggested by scale authors in their original validation study (Hausberg et al., 2012). Hausberg et al. (2012) suggested using the whole MZQ scale before further validation of the measure. In our results, however, the one-factor model had a poor fit, and the AVE value of the one-factor model indicated that most of the variance explained by the model was due to measurement error. These results contrast with the use of the total score of the measure as a valid indication of mentalization ability.
The four-factor measurement model of MZQ did not provide a very good fit with the data. The reliability of the overall scale, as well as subscales other than refusing selfinspection, was acceptable. Our findings regarding subscale reliability are in line with Hausberg et al. (2012). In our findings on the factor structure of the measure, SRMS showed similar results, but the RMSEA indicated a poorer fit on our CFA. In contrast to Hausberg et al. (2012), we used incremental fit indices in addition to absolute fit indices to assess the model fit, thus obtaining a more multifaceted view of the fit.
In the original validation study, Hausberg et al. (2012) described the indication of convergent and discriminant validity for the MZQ total score assessed with correlations with measures of symptom severity. We used AVE values to assess the convergent validity of the subscales, and we also assessed the discriminant validity of the subscales by comparing AVE values and factor correlations. In our findings, lower than expected AVE values suggested problems with the convergent validity of the subscales. AVE represents the average amount of variance that the construct explains in its indicator variables relative to the overall variance of its indicators. None of the AVE values in our MZQ measurement models reached a satisfactory level, implying that most of the variance explained by the latent variables of the measurement models reflect measurement error. Also, due to low AVE values and high correlations between factors, our data suggest that the discriminant validity of factors proposed by Hausberg et al. (2012) is low. According to the indications of unsatisfactory convergent and discriminant validity, there are many items with quite low (<.50) factor loadings, implying item cross loadings.
It is possible that the factor structure and validity of the measure was somewhat different in our data than in the original validation sample because we used a population sample instead of a clinical sample. It is also possible that our results reflect an unsuccessful Finnish translation of the MZQ and that the items in the English-language version capture the true mentalization ability with less measurement error. However, there is a need for further studies regarding MZQ before recommending it as a valid measure of adolescent mentalization ability among Finnish adolescents.
Finally, we used modification indices to find an alternative exploratory model for MZQ after finding the initial model did not fit the data well. In our model, F1 can be considered for factor measuring ER and the understanding of one's own emotions. F2 can be considered as a factor that relates to perceiving relationships and being criticized as a major threat. F3 seems to be related to perceiving emotions as a threat and an uncontrollable force, and this feature could be related to the risk for mentalization failures caused by intense emotions. In our model, F4 consists of items reflecting rigid and concrete thinking and a nonmentalizing attitude toward one's mind in general. However, because modification indices can lead to model overfitting, it is important to note that this explorative model may fit only this data. Using these results in a more general way requires replicating it in other data.

Emotion regulation
We assessed four different measurement models for the versions of DERS. Our findings are in line with Kaufman et al. (2016), indicating that the short version of DERS shows similar, and partly stronger, psychometric properties than the full version, at least among adolescents, and our data gives more weight to recommendations suggesting the use of the shorter version of the measure. Reliability seemed to be very similar between the short version (without items with weak factor loadings) and the full version, but the AVE values (an indication of convergent validity) were generally stronger in the short version. Bardeen et al. (2012) pointed out that the AWARENESS subscale does not correlate well with other subscales and does not contribute much to the general DERS factor. They have suggested that the AWARENESS subscale does not belong to the same higher-order construction as other subscales and recommended using DERS without the AWARENESS subscale as a more contiguous, unified measure. Our data is in line with their findings. AWARENESS is also the only DERS subscale using mostly reversed items, and these items may create a method factor that reflects more effects from the measure than the effects from the measured psychological construct (DiStefano & Motl, 2006).
There have been other suggestions about handling the problems of the AWARENESS subscale. Cho and Hong (2013) have suggested that the AWARENESS and CLARITY subscales could be combined into one, as an understanding emotions construct with a controlled method factor for reverse-scored items. However, in our data, the AWARENESS subscale was the psychometrically weakest, with the lowest reliability, convergent validity (AVE), and item factor loadings in the short and full versions of the measure. This suggests that removing the factor from the measure would be the simplest and, perhaps, most recommendable way of using the DERS.
In our sample, females reported more overall ER difficulties than males. On the AWARENESS subscale, however, the mean score for boys was higher than for girls, but the difference was not statistically significant. In the short version of the scale, the results were otherwise similar, but in the IMPULSE subscale, the difference between the genders was not statistically significant. In a previous general population validation study with Dutch adolescents (Neumann et al., 2010), girls scored significantly higher than boys on total scores and on most subscales. In the results for AWARENESS, boys had higher scores than girls; on the IMPULSE subscale, there were no gender differences in their sample. Overall, the gender differences in our sample were very similar to a study by Neumann et al. (2010), even though there were slight differences between the samples.

Identity integration
We received mixed CFA results concerning the fit of the data using the AIDA model. The absolute goodness-of fit indices (RMSEA, SRMR, and v2/df) indicated acceptable fit for the model, but the incremental goodness-of-fit indices (CFI and TLI) were quite far from the expected guidelines, suggesting a poor fit. However, as incremental fit indices compare the user measurement model against a supposedly poorly fitting baseline model, they might be misleading if the baseline model fits the data exceptionally well (Kenny, 2015;Kenny et al., 2015). In our study, this was the case. The RMSEA of the baseline model suggested that the absolute goodness-of-fit indices would be more credible, and the construct of the AIDA measurement model developed by Goth et al. (2012) would be convergent with our data.
The composite reliability of the AIDA subscales was acceptable, and the estimates of the reliability were very similar to Goth et al. (2012) original validation study. The AVE values of the subscales, however, were lower than expected, implying that AIDA produces a significant amount of measurement error and that the items inside the subscales do not necessarily measure the same things. Low AVE values also resulted in a lack of discriminant validity between the subscales. One possible way to improve the measure could be the removal of some items with low factor loadings to examine whether the coherence of the subscales would improve without jeopardizing reliability.
Another possible problem of the measure could be that many reverse coding items seem to cluster on the first factor (stability in attributes). This suggests that the stability in attributes subscale has features of a method factor, reflecting the properties of the measure and the differences or biases in the answering style of the respondents rather than the differences in the underlying psychological construct. To evaluate this issue, we built a model where all reversed items were controlled as a method factor and all items on the first factor were collapsed into other subfactors with the same main factor. It fitted the data better than the original model, suggesting that the reverse-coded items affected the factor structure of the measure. However, more confirmative studies with different datasets are needed.
The bias with reverse coding items should not have a great impact on the clinical usefulness of the measure because the higher factor structure of the measure seems to be confirmed and all the factors correlate with each other quite strongly. However, with scientific use, it is important to be aware of this issue because the method factor might affect, for example, the fit of some more complex structural equation models. The debate regarding the pros and cons of reverse coding items is ongoing. Some researchers suggest that reverse coding items should not be used at all, while others see them as useful despite possible bias with the factor structures. Weijters et al. (2013) have stated that the same biases affecting reverse coding items also partly affect direct items, and using reverse items at least brings these biases out for us to acknowledge.
In our sample, the girls reported more problems with identity than the boys overall. The differences between the genders were also significant on the subscales, except for stability in relations. Our findings regarding gender are quite similar to the original validation study of the measure (Goth et al., 2012), where girls had slightly higher mean scores than boys on all scales, but on the stability in the attributes subscale the gender difference was not significant.

Limitations
This study has some limitations. Our data does not include any criterion variables, such as alternative measures of mentalization, ER, identity, or measures of the mental health variables. This limitation prevents us from addressing the criterion validity of the measures or the convergent validity of the full measures. Our scope, therefore, is limited to assessing the factorial structure of the measures, the convergent and discriminative validity, and the coherence of the different subscales. Our study used a general adolescent population data, and that limits generalizing our findings directly to specific patient populations. Data was collected from four schools in three different cities. Schools were selected by contacting larger number of schools and including them who were willing to participate. This type of selection has potential for biasing results, but for the schools included and schools in Finland in general are quite homogenous, bias in unlikely.
Adequate sample size in CFA is highly contextual (Wolf et al., 2013), and the best way to determine adequate sample size would be by conducting simulations prior data collection. We were unable to do this, and our sample size is based on more uncertain rules of thumb. Therefore, our sample size might be problematic, mainly for AIDA models with relatively high number of items, and parameters to be estimated. However, in our results regarding AIDA models, there were no straightforward indications of limited statistical power, e.g. insignificant parameters.

Conclusions
The results of the present study indicate that there is still some work to do in order to gain the valid measurement of three core transdiagnostic concepts, mentalization, ER and identity diffusion. First, the factor structure of the Finnish translation of MZQ was not confirmed for use among adolescents, and the measure seemed to produce quite a significant amount of measurement error. There is a need for further studies concerning the structure and construct validity of the MZQ. Second, the results from this study imply that some modification may be necessary in order to obtain the most reliable results from DERS and AIDA. Especially, we recommend using the short version of DERS without the lack of emotional awareness subscale. Concerning AIDA, our results confirmed the overall factor structure, but the stability in the attributes subscale might have some features of a method factor, and this should be taken into account when using AIDA for research. Also, the subscales show a somewhat problematic amount of measurement error affecting the convergent and discriminative validity of the subscales.
All procedures performed in studies involving human participants were in accordance with the ethical standards of the Ethics Committee of the Tampere region (31/2017), and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all individual participants included in the study. The authors declare that they have no conflict of interest.