Observer-based and computerized measures of the patient’s mentalization in psychotherapy: A scoping review

Abstract Objective In recent decades, mentalizing has found its permanent place both in therapeutic practice and in psychotherapy research. Inconsistent results and null results are often found. Therefore, the different methodological approaches should be examined in more detail. A scoping review was conducted to provide an overview of the approaches that measure the patient's mentalizing ability based on therapy sessions or in the course of psychotherapy. Method For the scoping review, a literature search was conducted in four databases. A total of 3217 records were identified. Results We included 84 publications from 43 independent studies. Most studies used the Reflective Functioning Scale and applied the scale to therapy sessions or the Adult Attachment Interview. The other identified approaches used a computerized text analysis measure or clinician-report measures. Mostly good psychometric properties of the measures were reported. The Reflective Functioning Scale applied to the Adult Attachment Interview was the only measure that proved to be sensitive to change. Conclusion More economical variants to the time-consuming Reflective Functioning Scale applied to the Adult Attachment Interview are being developed continuously. In some cases, there is no standardized approach, or the measures are used only sporadically and require further and more comprehensive psychometric evaluations.


Dimensions of mentalizing
Mentalizing is a multidimensional construct with four different dimensions with different underlying neural systems (Fonagy & Luyten, 2009).Each dimension distinguishes between two poles.Mentalizing can be (a) related to oneself or other people (Fonagy & Bateman, 2019).The process can be (b) automatic (implicit) or controlled (explicit).Automatic mentalizing is faster and requires less effort and awareness than controlled mentalizing but is also more reflexive and tends to be overly simplistic.The basis for inferences can be (c) external features such as facial expressions and gestures or considerations of a person's internal experiences.Mentalizing can be (d) cognitive and refer to understanding mental states, or it can be affective and refer to feeling the mental state.Effective mentalizing is characterized by a balance across all dimensions.Appropriate mentalizing is moreover dependent on the situation (Bateman & Fonagy, 2016).

Prementalizing modes
If mentalization is ineffective, a distinction can be made between the three different prementalizing modes: teleological mode, psychic equivalence mode and pretend mode (Luyten et al., 2020).These are similar to the way children behave before their mentalizing ability is fully developed (Fonagy & Bateman, 2019).In the teleological mode, mental states and their meaning are recognized only in the context of observable behavior.This is also reflected in an imbalance toward external mentalizing.In the psychic equivalence mode, one's feelings and thoughts are experienced as the only reality, and there is little room for alternative explanations.In the clinical context, this is also referred to as the concreteness of thoughts (Bateman & Fonagy, 2016).External, affective mentalizing about the self dominates in this mode (Luyten et al., 2020).In the pretend mode, mental states are perceived separately from reality.There appears to be an imbalance toward implicit mentalizing.Hypermentalizing may occur, in which much is said about mental states, but the reference to genuine experiences is missing, and the inferences seem groundless (Fonagy & Bateman, 2019).

Mentalizing in psychotherapy Research
In a clinical context, mentalizing is particularly prominent in association with a borderline personality disorder (BPD).It is used for developmental models of BPD and treatment (Bateman & Fonagy, 2004;Fonagy, 1991;Fonagy & Bateman, 2007).The mentalizing profile for BPD typically shows an imbalance toward automatic, external, and affective mentalizing (Bateman et al., 2019).Bateman and Fonagy (2004) argue that promoting mentalizing is a common factor in BPD treatment.In empirical studies, patients with BPD show low overall reflective functioning (RF) which is the operationalization of mentalizing (Fischer-Kern et al., 2010;Gullestad et al., 2013;Levy et al., 2006), and this capacity is lower than that of patients with other or without personality disorders and lower than that of healthy controls (Fonagy et al., 1996).Comparably low scores are sometimes also found in patients with depression (Fischer-Kern & Tmej, 2019) and eating disorders (Fonagy et al., 1996;Ward et al., 2001).This supports the view that mentalizing is applicable as a transdiagnostic concept (Luyten et al., 2020) and leads to an interest in researching the role of mentalization in the psychotherapeutic process.Allen et al. (2008) suggest that mentalizing is relevant in all psychotherapies and impacts treatment outcomes.Systematic reviews have reached inconclusive results regarding mentalizing as a predictor of therapy outcome (Katznelson, 2014), change in mentalizing ability as a result of therapy, and the association between a change in mentalizing and a change in outcome (Lüdemann et al., 2021).These results can be reconsidered with regard to the underlying theoretical assumptions.For example, there seem to be more results consistent with the hypothesis for psychodynamic therapies (Katznelson, 2014).However, there is also a need for more in-depth investigations of the validity of existing instruments (Luyten et al., 2019) and for more refined assessments of mentalizing that capture, for example, the dimensions of mentalizing (Katznelson, 2014).
This scoping review aims to provide an overview of the available approaches to measure the patient's mentalization during psychotherapy.The specific questions to be investigated are (1) which observerbased and computerized measures are used at which measurement time points, (2) what psychometric properties do the measures have, and (3) how sensitive the measures are to change.

Methods
This scoping review was conducted in accordance with the guidelines of the Joanna Briggs Institute (Peters et al., 2020) and the PRISMA-ScR checklist (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews, Tricco et al., 2018).The review protocol is available at http://osf.io/vaqck/.

Inclusion Criteria
The inclusion criteria are based on the PCC framework (population, concept and context).
The concept of interest for this scoping review is observer-based and computerized measures that assess the mentalizing ability of individuals receiving psychotherapy.The only other restriction for the population is that they must be adults (i.e., at least 18 years old).Studies were included that either used a measure to assess mentalization or that developed or validated such a measure.Only measures that relate to Fonagy's definition of mentalization are considered in order to keep the heterogeneity low.Measures based on related constructs (i.e., theory of mind, metacognition) are excluded.Measures that assess parental RF are also excluded.The data collection should take place in the context of psychotherapy.To ensure this, the following conditions are set: (1) If the measure is used, it must take place either at a minimum of two different time points during psychotherapy or with a specific therapy session as an evaluation basis, and (2) if the measure is developed, it must be specific for use in psychotherapy.The first condition was changed during the screening process from measurement at the beginning and end to measurements at a minimum of two different time points.

Search Strategy
An electronic search was performed in four databases (APA PsycArticles Full Text, Ovid MEDLINE(R) ALL 1946to February 19, 2023;APA PsycInfo 1806to February Week 2 2023;PSYNDEXplus Literature andAudiovisual Media, 1977 to January 2023;and PSYNDEXplus Tests, 1945 to November 2022).The search strategy includes terms for mentalizing (reflective function * , mentali#ation, mentali#ing, mentalisieren) and psychotherapy (psychother * , therap * , treatment * ).To narrow down the results, some words (parental, maternal, paternal, adolescen * , jugend * ) were not allowed to appear in the title or abstract.The search included records from 1991 to February 2023.The time limit was chosen because Fonagy and colleagues promoted their concept of mentalizing in 1991 (Fonagy, 1991;Fonagy et al., 1991).No languages were specified as inclusion criteria in advance, and no study had to be excluded only because of its language.However, the focus was on English and German texts, and the search terms were in these languages.Sources of evidence included published work, such as journal articles or books, and unpublished work, particularly dissertations.Reviews were not considered.

Source of Evidence Screening and Selection
The screening process was carried out in four stages.In a pilot phase, the clarity of the inclusion criteria was checked.For this purpose, the titles and abstracts of 25 randomly selected records were screened.Reviewers agreed in 92% of the cases.In the next step, all records were screened based on the title and abstract.The results were discussed at regular intervals to consider any necessary modifications to the inclusion criteria.Then, a decision was made based on a full-text examination.Finally, studies citing the included studies were identified.These studies and the reference lists of the included studies were screened for more eligible studies.Two reviewers (L.L. and L.H.) conducted the entire screening process independently.Disagreements were resolved by consensus.Data extraction was also performed by both reviewers.

Data Extraction
The following data were extracted: author(s), title, year of publication, source of evidence, independent study, study design, sample characteristics (age, sex, diagnosis, sample size), therapy characteristics (therapy, treatment duration, frequency), measure (s) (measure, evaluation basis, measurement points, interrater reliability, other psychometric properties) and mentalization-related results (descriptive data: mean score (m) and standard deviation (sd), if available).

Analysis
The extracted data were grouped according to different content aspects, and the results are presented in a descriptive and tabular form.
To test the change sensitivity of the RF measures, pre-post effect sizes were determined.Hedges' g was used as the effect size and was calculated in such a way that positive effect size represents an improvement in RF.An overall effect size across all instruments was determined, as well as for each instrument separately.In the case that two instruments were used to measure RF in a study, they were combined for the general analysis by calculating Psychotherapy Research 421 the mean effect size for both instruments.Different treatment arms of a study were included separately in the analysis.A random-effects model was used, and studies were weighted by inverse variance using DerSimonian and Laird's method (DerSimonian & Laird, 1986).For the evaluation of the effect sizes, the guidelines of Cohen (1988) are used, according to which effects > 0.2 are considered small, effects > 0.5 are considered medium, and effects > 0.8 are considered large.

Study selection
A total of 3217 records were screened, and 180 full texts were checked for eligibility.Of these, 84 publications belonging to 43 independent studies could be included.The reasons for exclusion are shown in Figure 1.

Study Characteristics
The main characteristics of the included studies are summarized in Table S1.Studies were conducted in Europe (i = 25), North America (i = 16) and South America (i = 2).Most studies have a naturalistic design, thirteen studies originate from randomized controlled trials (RCTs), and another eight studies are single case studies.The sample size varies between n = 1 and n = 400.Many studies investigate samples with personality disorders (i = 10) and, in particular, borderline personality disorder (i = 6).Other studies focus on samples with depressive symptoms (i = 4), eating disorders (i = 6), panic disorder (i = 2), or posttraumatic stress disorder (i = 2).In 35 studies, psychoanalytic and psychodynamic-oriented therapies are used for treatment.Therapies with a cognitive-behavioral background are used in 13 studies.

Measures and Measurement Time Points
Five different measures were identified: an observerbased measure, a computer program, and three clinician-report measures.
The Reflective Functioning Scale (RFS; Fonagy et al., 1998) is an observer-based measure that is applied to a narrative.Originally the RFS manual was developed to evaluate the Adult Attachment Interview (AAI; George et al., 1985).For this purpose, each question of the AAI is rated on a scale from −1 to 9. The range of −1 to 3 covers negative, absent, or questionable reflective functioning.Beginning with a value of 4, evidence of explicit reflections can be found.These can be assigned to the four qualitative markers: (a) awareness of the nature of mental states, (b) explicit effort to tease out mental states underlying behavior, (c) recognizing developmental aspects of mental states and (d) mental states in relation to the interviewer (Fonagy et al., 1998).The more elaborate and sophisticated (i.e., taking into account the mental states of different persons) a statement is, the higher it is rated.A distinction is made between demand and permit questions.Demand questions are expected to stimulate RF and are included in the overall score.Permit questions, on the other hand, are considered only if they contain explicit reflections or a negative RF.For evaluation with the RFS, the AAI is divided into eight demand and fifteen permit questions.Based on these separate scores, a total score is assigned, which can also vary between −1 and 9.When assigning the total score, the interview is considered as a whole, and the best fitting score is selected.For this purpose, there are rules specifying how often a score must be achieved and which scores (mostly negative RF) may not occur to achieve a certain total score (Fonagy et al., 1998).In the included studies, the RFS was applied to either a semistructured interview or a therapy session.Among the interviews used are the AAI as well as interviews based on the AAI that are either shorter (eight to eleven questions; Brief Reflective Functioning Interview, BRFI; Rudden et al., 2005;brief RF interview;Rudden et al., 2008) or rephrase the questions so that the therapist is the attachment figure rather than the parents (Patient-Therapist Adult Attachment Interview; Diamond et al., 1999;Patient Relationship Interview at Termination;Origlieri, 2017;Therapist Attachment Transference Interview;Szecsödy, 2008).One study used the Object Relations Inventory (ORI; Blatt et al., 1979).These narratives are used to determine a general RF.Other interviews focus specifically on a symptom, and the score obtained accordingly represents only the RF related to the symptom.This is called symptom-specific reflective functioning (SSRF).In the included publications, SSRF interviews were used for panic (Rudden et al., 2006), posttraumatic stress disorder (Rudden et al., 2009) or depression (Ekeblad et al., 2016).SSRF interviews consist of approximately three to five questions.
The application of the RFS to therapy transcripts is also referred to as in-session RF.To code insession RF, small adjustments are usually made to the RFS.This includes changing the D marker to mental states in relation to the therapist and extending the evaluation to interactions outside the primary attachment figures.In contrast to the semistructured interviews in which the questions specify the assessment units, this structure is missing in the therapy sessions.This leads to different approaches that are followed to form the assessment units.The most common is a division into blocks of 150 words (i = 7).Others use the complete session (i = 3), third of the session (i = 1), three-minute segments (i = 2), each patient statement respectively talk-turn (i = 5) or it is not comprehensibly described (i = 3).Another difference is the way the total score is aggregated.Commonly used are the mean score, the highest score, or an algorithm similar to the evaluation of the AAI described in the RFS manual (see Table I).
The next measure is a computerized text analysis measure of RF or short computerized RF (CRF; Fertuck et al., 2012).CRF is based on evaluations with the RFS.For this purpose, 18 AAIs were analyzed, and words characteristic of a high or low RF value were identified.The resulting dictionaries can now be used to evaluate new narratives.The narratives used are AAIs or therapy sessions.
We identified three clinician-report measures.The Reflective Function Rating Scale (RFRS; Meehan et al., 2009) consists of 50 items based on the RFS manual.The items can be completed by therapists or observers using a 5-point Likert scale.Factor analysis resulted in the three factors: defensive/ distorted, awareness of mental states and developmental.The Mentalization Imbalances Scale (MIS; Gagliardini et al., 2018) focuses on the dimensions of mentalizing and has the six subscales: cognitive, affective, others, self, automatic and external.The MIS has 22 items that are rated on a 6-point Likert scale.
The Modes of Mentalization Scale (MMS; Gagliardini & Colli, 2019) has 24 items that are also rated on a 6-point Likert scale.The MMS was developed to assess prementalizing modes and has the five factors: excessive certainty, concrete thinking, good mentalization, teleological thought, and intrusive pseudomentalization.One study tested whether the MIS and MMS could also be used as an observer-based instrument on the basis of therapy transcripts (Gagliardini et al., 2020a).
The RFS was used in 83.72% of all studies.The RFS is most often applied either to therapy sessions or AAIs (see Table II).Other variations of the RFS, as well as the CRF and clinician report measures, are used in only a few studies.The in-session RF is measured at 1 to 36 measurement time points (therapy sessions).The RFS applied to the AAI (or short RFS-AAI) is often assessed at the start and end of therapy.The sample size of studies using the in-session RF was on average lower than that of studies using the RFS-AAI (m = 25.52,sd Figure 1.Flow diagram adopted from the PRISMA statement (Moher et al., 2009).
Psychotherapy Research 423 = 38.31 vs. m = 42.06,sd = 36.61).It is noticeable that more data are collected with shorter interviews, such as the BRFI or SSRF, and with the CRF (see Table II), considering the amount of data collected, defined as the total number of interviews, sessions, or questionnaires analyzed in a study.

Psychometric properties
Interrater reliability.Interrater reliability is reported for 58.21% of the RF measures, and an intraclass correlation coefficient (ICC) is used for this purpose in most cases (see Table S1).The reliabilities can all be considered good to excellent, according to Cicchetti (1994).Only for one scale of the MIS is a lower value of ICC = .56reported (Gagliardini et al., 2020a).However, another study yielded a higher interrater reliability for the same subscale (ICC = .85;Gagliardini et al., 2020a).
Internal structure.Three studies examined the internal consistency and factor structure of the three clinician-report measures.The RFRS is examined by two studies.The independent factor analyses yield three factors each, but with clear differences in terms of item assignment (see Table III).Overall, the factor loadings of all items are sufficient to satisfactory except for one item of the factor blocked mentalizing.The internal consistency can be considered acceptable to good for most factors.Exceptions are the factors intrusive pseudomentalization and nonmentalizing behavior.Convergent validity.In 21 studies, two or three instruments were used to measure RF.Seven publications reported correlations between the RFS-AAI and another measure.Three more publications report associations between the RFS-BRFI (or the more abbreviated brief RF interview) and other instruments.The correlations represent medium to high effect sizes (see Figure 2).An exception is the association between the RFS-AAI and the third factor of the RFRS.

Sensitivity to Change
In 56 publications, RF is assessed at two measurement time points, and the descriptive data are reported.After excluding single-case studies and double-reported data in different publications from the same project, 25 publications from 24 studies could be included in the analyses.One study (Compare et al., 2018) was a noticeable outlier with an effect size of g = 6.19, so this study was excluded from the analyses.The other studies report effect sizes between g = −0.82 and g = 1.15 (see Figure 3).When all instruments are examined together, there is no effect for sensitivity to change (g = 0.08; 95% CI −0.06-0.21).Studies using the RFS-AAI showed a small effect (g = 0.27; 95% CI 0.10-0.44).No effect could be found for the RFS applied to all other interviews (g = 0.07; 95% CI −0.15-0.29)or the in-session RF (g = −0.17;95% CI −0.43-0.09).

Discussion
The aim of this scoping review was to present the different diagnostic approaches to measure the patient's mentalizing ability based on therapy sessions or in the course of psychotherapy.
In the review, 84 publications from 43 independent studies were included.We identified five instruments that explicitly refer to Fonagy's concept of mentalization: an observer-based measure (RFS), a computer program (CRF) and three clinicianreport measures (RFRS, MIS, MMS).The RFS was used most frequently.The narratives used are mainly therapy sessions and the AAI.Among the studies that use the RFS-AAI, there are many secondary analyses.This may give the false impression that the RFS is used predominantly with the AAI.Comparing the two most common approaches, an advantage of the in-session RF over the RFS-AAI is the possibility to examine almost any number of measurement time points (depending on the length of the therapy).However, research shows that the sample sizes tend to decrease with increasing measurement time points.
For observer-based instruments, interrater reliability is an important quality criterion.In this aspect, all instruments perform mostly well.This is an expected result for the RFS, since good interrater reliability has already been shown in several studies (Fonagy et al., 1998;Taubner et al., 2013).In addition, ratings with the RFS were often performed by raters who had undergone training and obtained certification.However, this is a time-consuming Psychotherapy Research 425 process.In contrast, clinician-report measures do not require training.The developers of the MIS (Gagliardini et al., 2018) and MMS (Gagliardini & Colli, 2019) build on the finding that therapists make reliable judgments.This is supported by the good interrater reliabilities of the therapists, which are higher than the interrater reliabilities of graduate students after training (Gagliardini et al., 2020a).One limitation that must be mentioned is the very small sample of therapists.Few studies have examined convergent validity between different RF measures.Especially for clinician-report measures, further research is still lacking.More results are available on the comparison of the RFS applied to different narratives.The moderate correlations support the use of the RFS on narratives other than the AAI.However, the reported coefficients in the included studies are noticeably lower than the results on the relationship between the RFS-AAI and RFS-BRFI (r = .71-.88; Andreas et al., 2021;Rutimann & Meehan, 2012).This is not surprising, as the BRFI was developed as a short version of the AAI, whereas, e.g., the symptom-specific reflective functioning (SSRF) interviews and therapy sessions have a different thematic focus and may assess a different aspect of mentalizing.
Another issue that limits the direct comparability of RF scores measured with the RFS applied to different narratives are the previously mentioned different approaches to the division into assessment units and the generation of a total score.As a result, the descriptive data of different studies are limited or not comparable with each other at all.The mean score of a session will nearly always be lower than the highest score of the same session.Using an algorithm as with the AAI will probably produce a value between the mean and highest scores.To the best of our knowledge, a systematic examination of the strengths and weaknesses of the different approaches to determine the in-session RF is not available.
Regarding the sensitivity to change, a significant result could be found only for the RFS-AAI.According to Cohen, there was a small effect for an improvement of the RF scores after therapy.Mentalizing is considered to have both trait and state aspects (Luyten et al., 2020), with the RFS-AAI capturing more of the stable trait (Hörz-Sagstetter et al., 2015).This consideration is supported by the small effect found in this review.No effect for sensitivity to change was found for the in-session RF.Possible reasons for this could be methodological shortcomings of the in-session RF.Although the RFS is slightly adapted when applied to therapy sessions, it is conceivable that more extensive adaptations are needed.As already noted, an AAI and a therapy session can be quite different in their topics, so other or additional qualitative markers might be needed to comprehensively identify RF in the context of a therapy session.This could contribute to an explanation of why the in-session RF seems to be highest at the beginning of psychotherapy.At the initial stage of therapy, there is often an exploration of the patient's family background, which is similar to the content of the AAI and may facilitate easy application of the RFS markers.A focus on the here and now in the further course of therapy as well as a transfer to everyday life and the farewell at the end of a therapy, in contrast, might be more difficult to rate for RF and might not open up so much opportunity for the patient to even show RF.Therefore, it seems important to consider the entire course of therapy in future studies.Bernbach (2001), for example, reports an initial improvement in RF scores followed by a decrease at the end of therapy.Likewise, Vermote et al. (2010) find the trend of a cubic change.The approach to assigning the overall score for the therapy session may also have contributed to the lack of sensitivity to change.A consideration could be that especially the variant with the highest individual rating as the total score could be very vulnerable to bias by the rater.This leads to the question whether an improvement in RF is characterized by someone mentalizing more frequently, by someone being able to verbalize more complex reflections, or by an increase in quantity and quality.Apart from possible psychometric weaknesses of the measures, other factors could cause or contribute to the failure to find a consistent change in RF.The characteristics of the included studies already indicate potential methodological limitations of research on mentalizing in psychotherapy.Only seven studies used an RCT design and examined all study arms concerning mentalization.Thus, there are few studies investigating mentalization under controlled conditions.Furthermore, over 60% of the studies had a sample size of n ≤ 30, which limits the generalizability of the study results.It also cannot be ruled out that patients' in-session RF or SSRF simply does not improve, during therapy, and therefore, no Psychotherapy Research 427 change can be measured.Although there is an assumption that mentalizing might be a transtherapeutic mechanism of change (Allen et al., 2008), the construct has a psychoanalytic background and might be particularly relevant in psychodynamic oriented therapies.This is supported by findings from Barber et al. (2020), Levy et al. (2006) and Rudden et al. (2006) in which the psychodynamic oriented study arm is superior to the behaviorally oriented arm with regard to an improvement in mentalizing.In this context, it is surprising that there are few studies on mentalization-based therapy (MBT; Bateman & Fonagy, 2004).MBT was initially developed to treat borderline personality disorder and aims to strengthen the patients' mentalization ability, thereby leading to a reduction in symptoms.
Therefore, it could be assumed that MBT is particularly well suited to investigate mentalizing in the psychotherapy process.This scoping review has some limitations.One limitation is excluding self-assessment instruments, which are a much more time-efficient way of collecting RF data.Among these are the Mentalization Questionnaire (MZQ; Hausberg et al., 2012), the Reflective Functioning Questionnaire (Fonagy et al., 2016) and the Mentalization Scale (Dimitri-jevićet al., 2018), as well as the newly developed Certainty About Mental States Questionnaire (Müller et al., 2021) and Multidimensional Mentalizing Questionnaire (Gori et al., 2021).However, it is questionable whether individuals with low mentalizing capacities have the ability to accurately assess themselves in terms of this capacity (Fonagy et al., 2016).To our knowledge, of these self-report measures, only the MZQ has been compared to the RFS-AAI.The MZQ is intended primarily as a screening method for clinical samples.It could be shown that individuals who obtain a low RF score based on the AAI or BRFI also have a lower score on the MZQ than individuals who obtain an average or above-average RF score (t-test, df = 135, p < .01;Andreas et al., 2021).
The meta-analysis results can only be interpreted with caution due to the large heterogeneity of the included studies with regard to diagnosis, type and duration of therapy, and measurement time points.In addition, some instruments have not yet been implemented often enough to make a valid statement regarding their sensitivity to change.
The broad inclusion criteria are simultaneously a weakness and strength of this review.The disadvantage is that quite heterogeneous studies are compared, and many details could not be addressed.The advantage is that many publications could be included, and thus, a thorough overview of the research field could be given.

Conclusion
Strengths of the RFS-AAI are an in-depth analysis, an interpretable total score and its well-studied convergent validity.In line with theoretical considerations, associations with psychopathology (Fonagy et al., 1996;Kuipers & Bekker, 2012;Kuipers et al., 2016), attachment (Bouchard et al., 2008;Klasen et al., 2019;Nazzaro et al., 2017), infant attachment (Ensink et al., 2016;Steele et al., 1996) and level of structural functioning (Daudert, 2001) could be shown in many cases.A major limitation of using the RFS-AAI in research is the time required for its application (Choi-Kain & Gunderson, 2008).For the generation of one RF value, a workload of up to 15 hours can be expected, consisting of the time needed for the interview, the transcription and the coding with the RFS (Taubner & Sevecke, 2015).This has a particularly unfavorable impact on conducting research designs that require repeated measures that are needed, for example, when investigating whether RF improves with psychotherapy or whether RF is a mechanism of change.Another limitation is the recommended time interval of six months between assessments with the AAI to prevent repetition effects.
Shorter interviews such as the BRFI have an obvious economic advantage.This allows for larger samples and more measurement time points.In addition, shorter intervals between assessments can be selected, which is necessary for research on short-term therapies, for example.SSRF interviews are even shorter and have the added advantage of a different and perhaps more therapy-relevant topic, the patient's symptoms.The use of SSRF becomes particularly exciting under the assumptions that symptom-specific RF may be more impaired than general RF (Rudden et al., 2008) and, accordingly, more likely to improve.
The advantage of in-session RF compared to interview-based measures and questionnaires (self and clinician-report) is the possibility of analyzing the therapy process on a micro level.This allows, for example, the study of the direct influence of specific therapeutic interventions on the patient's RF (e.g., Georg et al., 2019;Karlsson & Kermott, 2006;Möller et al., 2017).On the other hand, in-session RF does not seem to be the most appropriate tool for pre-post designs.Studies consistently find high fluctuations within and between sessions of the same patient (Hörz-Sagstetter et al., 2015;Josephs et al., 2004;Kornhas et al., 2020;Zeeck et al., 2022).This could lead to bias in the results based on the selected sessions.This circumstance often seems to be accounted for by using several sessions for one measurement time point.
Another criticism of the RFS in general is the total score, which does not account for the different dimensions of the mentalizing construct (Choi-Kain & Gunderson, 2008;Solbakken et al., 2011).The scale appears to be slightly unbalanced with an overly sensitive distinction for good mentalizing whereas the low range could be more differentiated.This is particularly relevant in clinical samples, where low scores are predominant.At this point, the MIS and the MMS may help to better describe the individual deficits in mentalizing and to plan appropriate therapeutic interventions.
Mentalizing continues to be a popular concept, and new measures are constantly being developed.However, the various instruments seem to be applied only very sporadically in some cases.There is a lack of standardized approaches, comprehensive psychometric studies, and replications.The bestvalidated instrument is the Reflective Functioning Scale, whose main weakness is its time-consuming application.Thus, an alternative for large-scale studies is needed.The more economical Computerizes Reflective Functioning and the clinicianreport measures Mentalization Imbalances Scale and Modes of Mentalization Scale could become such an alternative.However, further validation is required for all three measures.
Psychotherapy Research 429

Figure 3 .
Figure 3. Effects of sensitivity to change sorted by the used measure and effect size.

Table I .
The different approaches to generate the total score for in-session RF.

Table II .
Frequency distribution and measurement time points of the measures.
a Measurement time points for two studies were not included because they used only selected episodes from the therapy sessions instead of the entire session.

Table III .
Factor structure and internal consistency.