Latent Markov Factor Analysis for Exploring Measurement Model Changes in Time-Intensive Longitudinal Studies

When time-intensive longitudinal data are used to study daily-life dynamics of psychological constructs (e.g., well-being) within persons over time (e.g., by means of experience sampling methodology), the measurement model (MM)—indicating which constructs are measured by which items—can be affected by time- or situation-specific artifacts (e.g., response styles and altered item interpretation). If not captured, these changes might lead to invalid inferences about the constructs. Existing methodology can only test for a priori hypotheses on MM changes, which are often absent or incomplete. Therefore, we present the exploratory method “latent Markov factor analysis” (LMFA), wherein a latent Markov chain captures MM changes by clustering observations per subject into a few states. Specifically, each state gathers validly comparable observations, and state-specific factor analyses reveal what the MMs look like. LMFA performs well in recovering parameters under a wide range of simulated conditions, and its empirical value is illustrated with an example.


INTRODUCTION
Time-intensive longitudinal data for studying daily-life dynamics of psychological constructs (such as well-being and positive affect) within persons allow to delve into time-or situation-specific effects (e.g., stress) on the (e.g., emotional) experiences of a large number of subjects (Larson & Csikszentmihalyi, 2014). The go-to research design to collect such data is experience sampling methodology (ESM; Scollon, Kim-Prieto, & Diener, 2003). Participants repeatedly answer questionnaires at randomized or event-based time-points via smartphone apps, for example, eight times a day over a few weeks.
While the technology for collecting ESM data is readily available, the methodology to validly analyze these data are lagging behind. This article provides an upgrade of the methodology by presenting a novel method for tracking and diagnosing changes in measurement models (MMs) over time.
The MM is the model underlying a participant's answers and indicates which unobservable or latent variables (i.e., psychological constructs) are measured by which items. Traditionally, it is evaluated by factor analysis (FA; Lawley & Maxwell, 1962), where the factors correspond-ideallyto the hypothesized constructs. Factor loadings express the degree to which each of the items measures a factor and thus how strongly an item relates to an underlying factor. In order to meaningfully compare constructs over time, the MM needs to be invariant across measurement occasions (Adolf, Schuurman, Borkenau, Borsboom, & Dolan, 2014). However, measurement invariance (MI) does not always hold over time because the MM likely changes over the course of an ESM study. First, in ESM, the measurement quality is undermined by time-or situation-specific artifacts such as response styles (RSs; Moors, 2003;Paulhus, 1991). Indeed, participants fill in their questionnaires repeatedly in various, possibly distracting, situations (e.g., during work) or lose motivation to repeatedly answer questions, which may drive the tendency to, for example, use the extreme response categories only (extreme RS; Moors, 2003;Morren, Gelissen, & Vermunt, 2011). Second, substantive changes may occur over time in what questionnaire items are measuring. For example, depending on the context or mental state, an item may become more important for the measured construct (i.e., loading increases) or (also) an indicator of another construct (i.e., loads strongly on another factor; reprioritization or reconceptualization; Oort, Visser, & Sprangers, 2005). Moreover, the nature of the measured constructs might change entirely; for example, when positive affect and negative affect factors are replaced by high and low arousal factors (Feldman, 1995). In any case, when ignoring changes in the MM, changes in the scores will be interpreted as changes in the psychological constructs, although they are (partly) caused by RSs or changed item interpretation.
To safeguard validity of their time-intensive longitudinal studies, substantive researchers need an efficient approach to evaluate which MMs are underlying the data and for which time-points they apply, so that they can gain insight into which artifacts and substantive changes are at play and when. Researchers can take these insights into account when analyzing the data, when setting up future projects or to derive new substantive findings from the MM changes. To meet this need, we present latent Markov factor analysis (LMFA), 1 which combines two building blocks to model MM changes within subjects over time: (1) latent Markov modeling (LMM; Bartolucci, Farcomeni, & Pennoni, 2014;Collins & Lanza, 2010) clusters time-points into states according to the MMs and (2) FA (Lawley & Maxwell, 1962) evaluates which MM applies for each state. Note that LMFA can be applied for single cases, when enough observations are available for that one subject.
Within the states of LMFA, exploratory factor analysis (EFA) rather than confirmatory factor analysis (CFA) is used. In CFA, users have to specify which items are measuring which factors based on a priori hypotheses. This implies that certain item-factor relations are assumed to be absent, and the corresponding factor loadings are set to zero. Thus, for a large part, CFA already imposes a certain MM and thus limits the changes in the MM that can be found. In contrast, EFA estimates all factor loadings and thus explores all kinds of (unknown) MM changes, including changes in cross-loadings (i.e., items loading on more than one factor) or even in the nature and number of factors (e.g., an additional RS factor). However, if desired, CFA can be used within the states.
An existing method to evaluate whether MI holds over time is longitudinal structural equation modeling (LSEM; Little, Preacher, Selig, & Card, 2007). However, this method merely tests whether MI across time-points holds for all individuals simultaneously, without directly providing insight in for which measurement occasions invariance is violated and what the alternative MMs look like. In contrast to LMFA, LSEM provides no clues for understanding or dealing with the noninvariance. Also, it applies CFA, and thus already assumes a certain factor structure, and is thus too restrictive to detect many MM differences. A few methods exist that combine FA with LMM and thus could potentially be useful for identifying violations of MI over time 2 (Asparouhov, Hamaker, & Muthen, 2017;Song, Xia, & Zhu, 2017;Xia, Tang, & Gou, 2016). However, these methods also apply CFA, making them too restrictive to detect all kinds of MM differences. In contrast, factor-analyzed hidden Markov modeling (FAHMM; Rosti & Gales, 2002) is similar to LMFA because it combines EFA with LMM but was developed merely for accommodating LMM estimation when conditional independence is violated among many variables, using the state-specific FA to reduce the number of parameters of the state-specific covariance matrices rather than being the point of interest (Kang & Thakor, 2012;Rosti & Gales, 2002). Also, FAHMM cannot analyze multiple subjects simultaneously. Thus, LMFA may be conceived as a multisubject extension of FAHMM, tailored to tackle measurement noninvariance in time-intensive longitudinal data.
The remainder of this article is organized as follows: Section 2 describes the multisubject longitudinal data structure, an empirical example, and the LMFA model specifications and estimation. Section 3 presents a simulation study, evaluating the goodness of recovery of states and state-specific MMs under several conditions as well as model selection. Section 4 illustrates LMFA with an application. Section 5 concludes with some points of discussion and directions for future research.

Data structure and motivating example
Like in ESM, we assume repeated measures data where observations are nested in subjects. For each measurement occasion, 1 Latent Markov factor analysis (LMFA) builds upon mixture simultaneous factor analysis (MSFA; De Roover et al., 2017), which captures differences in the factor model between groups. Whereas MSFA typically models the data of subjects nested within groups, LMFA specifically deals with observations nested within subjects, and it allows subjects to switch between different measurement models (MMs) over time. data on multiple continuous variables are available. The observed scores are indicated by y ijt , where i ¼ 1; . . . ; I refers to subjects, j ¼ 1; . . . ; J to items, and t ¼ 1; . . . ; T to timepoints, where the latter may differ across subjects (i.e., T i Þ but we mostly omit the index i for simplicity of notation. The J × 1 vector y it ¼ ðy i1t ; y i2t ; . . . , y iJt Þ' contains the multivariate responses for subject i at time-point t and the T × J datamatrix Y i ¼ y i1 ; y i2 ; . . . ; y iT ð Þ 0 contains data for subject i for all T time-points. To clarify the data structure and illustrate the problem of measurement noninvariance, consider the ESM data of the "No Fun No Glory" study described in more detail by Van Roekel et al. (2017). In brief, the data contained repeated emotion measures of 69 young adults with persistent anhedonia, which is the diminished pleasure in response to previously enjoyable experiences and one of the core symptoms of depression (American Psychiatric Association, 2013;Treadway & Zald, 2011). Over a course of about 3 months, every evening, the participants rated on a Visual Analogue Scale, ranging from 0 ("Not at all") to 100 ("Very much"), how much they had felt each of 18 emotions (listed in Table 3, which is further described in Section 4) in the past 6 hr. 3 The number of repeated measures ranged from 86 to 132 (M = 106.86, SD = 8.21) and resulted in 7,373 total observations of which 557 were missing. 4 After the first month, the participants randomly received (a) no intervention (n = 22), (b) a personalized lifestyle advice (PLA; n = 23), or (c) a PLA and tandem skydive (PLA & SkyD; n = 24) to potentially reduce anhedonia. After the second month, all participants chose one of the interventions, regardless of their first one (no: n = 3; PLA: n = 17; PLA & SkyD: n = 49). In their original study, Van Roekel et al. (2017) investigated whether the interventions decreased anhedonia, thereby assuming the two underlying factors positive affect (PA) and negative affect (NA). However, if the MM changes over the course of participation (e.g., due to the interventions), conclusions about changes in PA and NA may be invalid. In Section 4, LMFA is used to trace potential MM changes in these data.

Latent Markov factor analysis
In this section, we introduce LMM (Section "Latent Markov modeling") before describing LMFA in more detail (Section "Latent Markov factor analysis").

Latent Markov modeling
The LMM (also a hidden Markov or latent transition model; Bartolucci et al., 2014;Collins & Lanza, 2010) captures unobserved heterogeneity or changes over time by means of latent states. In contrast to standard latent class models (Hagenaars & McCutcheon, 2002;Lazarsfeld & Henry, 1968), which identify subgroups or so-called latent classes within a population (e.g., high or low risk for depression), an LMM allows respondents to transition between latent states over time and thus to switch between subgroups (e.g., from a high-risk subgroup to a low-risk subgroup). Thus, the states may be conceived as dynamic latent classes. Specifically, the LMM is a probabilistic model where the probability of being in a certain state at time-point t depends only on the state of the previous time-point t À 1 (first-order Markov assumption). Furthermore, the responses at timepoint t depend only on the state at time-point t (local independence assumption; Bartolucci, 2006;Vermunt, Langeheine, & Böckenholt, 1999). The joint probability of observations and states for subject i is then zfflfflfflffl ffl}|fflfflfflffl ffl{ response probabilities (1) where s it are K × 1 binary variables indicating whether an observation belongs to a state or not and S i = s i1 ; s i2 ; . . . ; s iT ð Þis the subject-specific state membership matrix. In the following, k ¼ 1; . . . ; K refers to the states, and if s itk ¼ 1, subject i is in state k at time-point t. Equation (1) includes three types of parameters: (a) The initial state probabilities indicate the probabilities to start in a certain state, p s i1k ¼ 1 ð Þ ; and thus how the subjects are distributed across the states at t = 1. They are often denoted as π k , with P K k¼1 π k ¼ 1; and are gathered in a K × 1 vector π. (b) The transition probabilities indicate the probabilities of being in a certain state at time-point t conditional on the state at t À 1, pðs itk js itÀ1;l Þ, where l ¼ 1; . . . ; K. These may be denoted as a lk , with P K k¼1 a lk ¼ 1, and are collected in a K × K transition probability matrix A. The transition probabilities are often assumed to be homogeneous (i.e., invariant) across time (and subjects). The resulting sequence of states is called a latent Markov chain (LMC). (c) The response probabilities indicate the probability of a certain item response given the state at time-point t, p y it js it ð Þ, which correspond to a the multivariate normal density for continuous responses.

Latent Markov factor analysis
In LMFA, an LMM is used to capture the changes in MMs over time, and FA (Lawley & Maxwell, 1962) is applied per state to model the state-specific MMs. The latter is given by where Λ k is a state-specific J × F k loading matrix, f it is a subject-specific F k × 1 vector of factor scores at timepoint t (where F k is the state-specific number of factors), ν k is a state-specific J × 1 intercept vector, and e it is a subjectspecific J × 1 vector of residuals at time-point t. The distributional assumptions are as follows: f it , MVN 0; Ψ k ð Þ and factor scores are thus centered around zero and e it ,MVN 0; D k ð Þ, where D k contains the unique variances d kj on the diagonal and zeros on the off-diagonal. To partially identify the model, factor variances in Ψ k are restricted to one, and the remaining rotational freedom is dealt with by means of criteria to optimize the simple structure or between-state agreement of the factor loadings, such as Varimax (Kaiser, 1958), oblimin (Clarkson & Jennrich, 1988), or generalized procrustes (Kiers, 1997).
From Equation (2), it is clear that the states may differ in terms of their intercepts ν k , loadings Λ k , unique variances D k , and/or factor covariances Ψ k . This implies that LMFA allows to explore all levels of measurement noninvariance at once. This is (a) configural invariance (invariant number of factors and pattern of zero loadings), (b) weak factorial invariance (invariant nonzero factor loadings), (c) strong factorial invariance (invariant item intercepts), and (d) strict factorial invariance (invariant unique variances). Conveniently, in any case, the strictest level of invariance applies within each state (for more details, see Little et al., 2007;Meredith, 1993;Meredith & Teresi, 2006;Schaie, Maitland, Willis, & Intrieri, 1998). Figure 1 illustrates how LMFA captures the different levels of noninvariance over time based on an example of what might happen in the empirical data by comparing the State 1 MM, respectively, to the State 2 and State 3 MMs, with dashed lines representing parameter changes.
The depicted loadings can be thought of as standardized rotated loadings higher than, for example, 0.4 in absolute value (Stevens, 1992). We start by comparing the State 1 MM to the State 2 MM. Here, configural invariance is violated because a third factor ("high arousal" ["HA"]) appears, implying that the State 1 items measuring either PA or NA with loadings λ 141 , λ 151 , and λ 162 measure another construct (i.e., HA) in State 2 (now with loadings λ 242 , λ 252 , and λ 262 ). This also changes the meaning of the other factors into "low arousal PA" ("LA-PA") and "low arousal NA" ("LA-NA"). Next, we compare the State 1 MM with the State 3 MM. First, weak factorial invariance is violated here because λ 111 differs from λ 311 , and thus, the items measure PA and NA differently. Second, strong factorial invariance is violated because ν 12 differs from ν 22 . Note that, when weak invariance appears to hold, properly assessing strong invariance would require reestimating the model with invariant factor loadings across the states and nonzero state-specific factor means. Finally, strict factorial invariance is violated because e 11 differs from e 31 . Usually, strong factorial invariance is said to be sufficient for comparing latent constructs over time, that is, differences in factor means then correspond to actual changes in the latent variables.
It is important to note that the subjects do not have to go through all the states nor do they have to go through the states in the same order. Relatedly, LMFA does not assume FIGURE 1 Graphical illustration of a subject-specific LMC from an LMFA model, where the latent states per measurement occasion k t (t = 1, …, T) indicate the measurement model underlying a respondent's observed item scores. Note that to give a clear example, only the standardized factor loadings with an absolute value larger than 0.4 are depicted. Also note that the factor scores (e.g., on PA and NA) are not depicted in the figure since they are not part of the MM but that they may or may not change within a state over time but average zero. 560 VOGELSMEIER ET AL. homogeneous transition probabilities across subjects but allows for subject-specific A i matrices, implying that some transition probabilities may be zero for a certain subject if that subject does not go through a particular state. This is because subjects likely differ in how stable they respond to questionnaires (e.g., some people might switch more between contexts than others or may be more sensitive to contextual influence or distractions). The transition process A i is assumed to be time homogeneous for each subject, although this is an assumption that might be relaxed in the future.
To conclude, in LMFA, the states indicate for which time-points the data are validly comparable (strict MI applies within each state), and by comparing the statespecific MM parameters, one may even evaluate which level of invariance holds for which (pairs of) states and which specific MM parameters are noninvariant.

MODEL ESTIMATION
To estimate the LMFA model we aim to find the model parameters θ (i.e., the initial state probabilities π, the transition probabilities A i , the intercepts ν k , and the factor-analyzed covariance matrices Σ k ¼ Λ k Λ k þ D k ) that maximize the loglikelihood function logL. The logL is derived from Equation (1) by summing over all possible state sequences, taking the logarithm, and considering all the subjects at once: Note that the model captures the dependencies only between observations that can be explained by the states but not the autocorrelations of factors within the states. Because the logL is complicated by the latent states, nonlinear optimization algorithms are necessary to find the maximum likelihood (ML) solution (e.g., De Roover, Vermunt, Timmerman, & Ceulemans, 2017;Myung, 2003). LMFA can be estimated by means of Latent Gold (LG) syntax 5 (Vermunt & Magidson, 2016;Appendix B). Specifically, the ML estimation is performed by an expectation maximization (EM; Dempster, Laird, & Rubin, 1977) procedure described in Appendix A. Note that this procedure assumes the number of states K and factors within the states F k to be known. The most appropriate K and F k is determined by comparing competing models in terms of their fit-complexity balance. To this end, the Bayesian information criterion (BIC) can be applied, which proved to be effective for both FA (Lopes & West, 2004) and LMM (Bartolucci, Farcomeni, & Pennoni, 2015). Moreover, it may happen that the estimation converges to a local instead of a global maximum. To decrease the probability of finding a local maximum, LG applies a multistart procedure, in which the initial values are automatically chosen based on the loadings and residual variances obtained from principal component analysis (PCA; Jolliffe, 1986) on the entire data-matrix. For each state, randomness is added to get K different sets of initial parameter values (for more details, see De Roover et al., 2017).

SIMULATION STUDY Problem
To evaluate how well LMFA performs in recovering states and state-specific factor models, we manipulated seven factors that affect state separation and thus potentially the recovery: (a) number of factors, (b) number of states, (c) between-state difference (consisting of differences in factor loadings and intercepts), (d) unique variance, (e) frequency of transitions, (f) number of subjects, and (g) number of observations per subject and state. For the number of factors (a), we expect the performance to be lower for more factors due to the higher model complexity and the lower level of factor overdetermination (given a fixed number of variables; MacCallum, Widaman, Preacher, & Hong, 2001;MacCallum, Widaman, Zhang, & Hong, 1999;Preacher & MacCallum, 2002). With respect to the number of states (b), a higher number of states also increases the model complexity and thus, probably, decreases the performance. However, in case of a Markov model, the increase in model complexity with additional states is suppressed by the level of dependency of the states at consecutive time-points. Thus, with respect to (e), we anticipate LMFA to perform worse in case of more frequent state transitions, and thus lower probabilities of staying in a state, because this implies a lower dependence on the state of the previous timepoint (Carvalho & Lopes, 2007). With respect to (c), we expect a decrease in performance for more similar factor loadings (De Roover et al., 2017) and/or intercepts across states. Regarding (d), LMFA is expected to perform better with a lower unique variance and thus a higher common variance because this increases the factor overdetermination (Briggs & MacCallum, 2003;Ximenez, 2009;Ximénez, 2006). Factors (f) and (g) pertain to the within-state sample size (i.e., the amount of information) per state in terms of number of subjects and observations per subject and state. We expect a higher performance with increasing information (de Winter, Dodou, & Wieringa, 2009;Steinley & Brusco, 2011). Note that we also tested whether lag-one autocorrelations of factors harm the performance of LMFA, which was not the case (Appendix C). In addition, for selected conditions, we evaluated the BIC in terms of the frequency of correct model selection.
5 A user-friendly graphical interface in LG including a tutorial will be developed in the future.

561
Design and procedure We crossed seven factors in a complete factorial design with the following conditions 6 : a. number of factors per state F k at two levels: 2*, 4*; b. number of states K at three levels: 2*, 3, 4*; c. between-state differences at eight levels: d. unique variance e at two levels: 0.2 and 0.4*; e. frequency of state transitions at three levels: highly frequent, frequent, infrequent*; f. number of subjects N at three levels: 2, 5*, 10; g. number of observations per subject and state T ik at three levels: 50, 100*, 200.
The number of variables J was fixed to 20. The number of factors per state F k was either 2 or 4 (a) and was the same across the states. The two, three, or four states (b) differed in factor loadings and intercepts. The degree of the between-state loading difference (c) was medium, low (i.e., highly similar loadings), or nonexistent (i.e., identical loadings across states). Between the state-specific intercepts, there was no difference, a medium difference, or a high difference. The combination of no loading difference and no intercept difference was omitted because this implies no difference in MMs and thus only one state. Note that the degree of the between-state differences was the same for each pair of states.
Regarding the factor loadings Λ k of the generating model, for all conditions, a binary simple structure matrix was used as a common "base" (see Table 1). The loading matrices were representative for the ones commonly found in psychological research (cf., the PA and NA structure assumed by the original researchers of the "no fun no glory study"). In these matrices, all variables loaded on one factor only, and the variables were equally divided over the factors. In case of two factors, this implied that each factor had 10 nonzero loadings, whereas in case of four factors, each factor consisted of five nonzero loadings. For the "no loading difference" conditions, the simple structure base matrix was used as Λ k in all the states, implying no change in loadings across the states. For the low and medium loading difference conditions, the base matrix was altered differently for each state to create the state-specific loading matrices. Thus, no state will have a factor loading structure equal to the base matrix in Table 1. For each state, regardless of the number of factors, we applied the alteration procedure described below.
Whether F k = 2 or F k = 4, the manipulations were only applied to the first two factors. Thus, for F k = 4, the third and fourth factors are identical across states. For the medium loading difference conditions, the state-specific loading matrices were created by shifting one loading from the first factor to the second one and one loading from the second factor to the first one. This implies that the overdetermination of the factors is unaffected. For the low loading difference condition, the state-specific loading matrices were created by adding crossloadings of ffiffiffiffi :5 p for two variables, that is, one for Factor 1 and one for Factor 2, and lowering the primary loading accordingly to ffiffiffiffi :5 p . This manipulation preserves both the rowwise and columnwise sum of squares (i.e., the variables' common variance and the variance accounted for by each factor). Variables affected by the loading shifts and added cross-loadings differed across states (see Table 1). 7 To quantify the similarity of the state-specific loadings per condition, a congruence coefficient 8 φ (Tucker, 1951) was Factor 2 Note. For the medium loading difference, λ 1 ¼ 0 and λ 2 ¼ 1; for the low loading differences, λ 1 ¼ ffiffiffiffi :5 p and λ 2 ¼ ffiffiffiffi :5 p . Entries of Variables 5-10 and 15-20 equal those of Variables 4 and 14, respectively. The fourfactor matrices were created by applying the same λ 1 and λ 2 values to other variables because of fewer loadings per factor. medium loading difference and no intercept difference, low loading difference and no intercept difference, medium loading difference and low intercept difference*, low loading difference and low intercept difference*, no loading difference and low intercept difference, medium loading difference and medium intercept difference*, low loading difference and medium intercept difference*, no loading difference and medium intercept difference; computed per factor for each pair of the loading matrices. A φ of one indicates proportionally identical factors (as in the no loading difference conditions). The grand mean φ mean across all state pairs and factors amounted to 0.80 for the medium loading difference conditions and 0.94 for the low loading difference conditions, regardless of the numbers of states and factors. Finally, the matrices were rowwise rescaled such that the sum of squares of each row equaled 1 À e, where e was either 0.40 or 0.20 (g). Intercept differences were induced as follows. For all variables in all states, the intercept was initially determined to be 5 and kept as such for the no intercept difference conditions. Two of the intercepts (different ones across the states) were increased from 5 to 5.5 for the low intercept difference conditions and from 5 to 7 for the medium intercept difference conditions.
Regarding the frequency of state transitions (e), we manipulated three levels that we considered to be realistic for ESM data. Note that we allowed for between-subject differences in the transition probabilities by randomly sampling each set of subject-specific probabilities from a uniform distribution within a specified range of probabilities. Specifically, the probabilities to stay in a state and to switch to another state were, respectively, sampled from U(0.73, 0.77) and U(0.01, [0.27/(K−1)]) in the highly frequent condition, from U(0.83, 0.87) and U(0.01, [0.17/(K−1)]) in the frequent condition, and from U(0.93, 0.97) and U(0.01, [0.07/(K−1)]) in the infrequent condition. Then, for each resulting matrix, we rescaled the off-diagonal elements of each row to sum to 1 minus the diagonal element of that row, thus maintaining the probabilities to stay in a state and hence also the frequency of switching. As a result, out of the total number of possible transitions (i.e., across subjects (i.e., Þ) and across all data-matrices), a switch to another state occurred for 25% of the possible occasions in the highly frequent condition, for 15% in the frequent condition, and for 5% in the infrequent condition.
Depending on the condition, data-matrices with the above-described characteristics were simulated for 2, 5, or 10 subjects (f). Note that limiting our study to such low subject numbers not only confines the computation time but also challenges the method. We expect performance to improve with additional subjects because this accumulates the amount of data within the states. Furthermore, the number of observations per subject and state, T ik , was 50, 100, or 200 (g) for i = 1, …, I and k = 1, …, K. Thus, the total number of observations T i per subject depended on (b) and (g). Similarly, the within-state sample size per state (over subjects) depended on (f) and (g).
For each subject, an LMC was generated indicating in which state subject i was at each time-point t. The initial state was randomly sampled from a Bernoulli distribution (for k = 2) or multinomial distribution (for k > 2) with equal initial state probabilities. 9 The remaining LMC was generated by sampling a random sequence of states based on the subject-specific transition probability matrix (i.e., depending on (e)). Note that whenever a state was not represented in a sampled LMC-because the small sample sizes occasionally led to a data set wherein a certain state was not represented-we rejected it and sampled another one, such that parameter estimation was possible for all states.
Given this LMC, a subject-specific datamatrix was generated according to Equation (2) assuming orthogonal factors. First, we sampled a factor score vector f it~M VN 0; I ð Þ of length F and a residual vector e itM VN 0; D k ð Þ of length J for each of the T i observations, where the diagonal elements of D k are equal to 0.20 or 0.40 (g). Subsequently, each vector of observations y it was created with the loading matrix Λ k and vector of intercepts ν k pertaining to the state that subject i was in at time-point t, according to the subject-specific LMC. Finally, the subject-specific data-matrices Y i were conca- Twenty data sets Y were generated for each cell of the design. In total, 3 (number of states) × 2 (number of factors) × 8 (between-state difference) × 3 (transition frequency between states) × 3 (number of subjects) × 3 (number of observations per subject and state) × 2 (unique variance) × 20 (replicates) = 51,840 simulated data matrices were generated. The data were generated in the open-source program R (R Core Team, 2002) and communicated to LG (Vermunt & Magidson, 2016) for analysis. LG syntaxes (for details and an example, see Appendix B) were used to analyze the data with the correct number of states and factors per state. The average time to estimate a model was 85 seconds on an i5 processor with 8-GB RAM. Model selection was evaluated for a subset of the conditions (indicated by "*") and for five replications per condition, that is, for 80 data sets. The data sets were analyzed with LMFA models with the number of states equal to K − 1, K, and K + 1 states, and the number of factors within the states equal to F k − 1, F k , and F k + 1 and allowed to differ between the states, resulting in 19 models when K = 2 and 46 when K = 4. 8 Tucker's (1951) congruence coefficient between column vectors x and y is defined as: φ xy ¼ x 0 y ffiffiffiffi x 0 x p ffiffiffiffi y 0 y p . 9 We had no intention to evaluate the recovery of the initial state probabilities because more than a few subjects are required to validly estimate the distribution of initial starting states (Vermunt & Magidson, 2016). By sampling the initial state from a Bernoulli/multinomial distribution, some randomness in the initial states was obtained.

Sensitivity to local maxima
The estimation procedure, described in Appendix A, may result in a locally maximal solution, that is, the best solution may have a log L value that is smaller than the one of the global ML solutions. The multistart procedure (described in Section "Model estimation") increases the chance to find aglobal ML solution, and in the simulation study-where the global maximum is unknown due to violations of FA assumptions, sampling fluctuations and residuals-we can compare the best solution of the multistart procedure to an approximation (or proxy) of the global ML solution, which we obtain by providing the model estimation with the true parameter values as starting values. A solution is then a local maximum for sure when its logL value is smaller than the one from the proxy. To exclude mere calculation precision differences, we only considered negative differences with an absolute value higher than 0.001 as a local maximum. Accordingly, a local maximum was found for 947 out of 51,840 simulated data sets (1.83%), which mainly occurred when K = 4.

Goodness of state recovery
To investigate the recovery of the state sequence, the Adjusted Rand Index (ARI; Hubert & Arabie, 1985) was computed. The ARI quantifies the overlap between two partitions and is insensitive to permutations of the state labels. It ranges from 0 when the overlap is at chance level and 1 when partitions are identical. In general, the recovery of states was moderate to good (Steinley, 2004) with a mean ARI-value of 0.78 (SD = 0.28).
Except for the number of states, all manipulated factors had a large influence on the ARI (Table 2). In line with our expectations, the recovery improved with a lower number of factors (b), a greater between-state difference (c), a lower frequency of change (d), a higher number of subjects (e), a higher number of observations per subject and state (f), and lower unique variances (g). Figure 2 shows these effects, yet averaged across the number of factors and states for conciseness. A higher total within-state sample size was especially important for the state recovery in the high unique variance condition when combined with the low and no loading-difference conditions. In contrast, for a low unique variance and a medium loading difference between the states, the state recovery already stabilized at 400 observations. A lower frequency of transitions also further improved the state recovery, but it is most striking that even the most difficult conditions and lowest within-state sample size led to a perfect recovery when there was a medium difference in intercepts between the states.

Goodness of loading recovery
To examine the goodness of state-specific loading recovery (GOSL), we computed Tucker congruence coefficients φ between the true loading matrices and the estimated loading matrices and averaged across factors and states: To deal with the rotational freedom of the factors per state, we rotated the factors prior to calculating the congruence coefficient. 10 Specifically, Procrustes rotation was used to rotate the estimated toward the true loading matrices. To account for the permutational freedom of the state labels, the state permutation that maximizes the GOSL was retained. The manipulated conditions hardly affected the loading recovery. The overall mean GOSL was 0.98 (SD = 0.05), indicating an excellent recovery. There was a positive correlation between the ARI value and the GOSL (r s ¼ 0:45; p<0:001). Note that the loading recovery was often good despite a bad state recovery because quite similar (to even identical) loading matrices are mixed up.

Goodness of transition matrix recovery
To examine the transition matrix recovery, we calculated the mean absolute difference (MAD) between the true and estimated matrices (applying the best state permutation obtained from the loading recovery): The transition matrix recovery was good with an average MAD trans of 0.08 (SD = 0.10). Overall, the effects of the manipulated conditions were rather small (see Table 2).

Goodness of intercept recovery
To evaluate the recovery of the state-specific intercepts, we calculated the MAD between the true intercepts and the estimated ones.
The intercept recovery was moderate with an average MAD int of 0.12 (SD = 0.02). A higher between-state difference of loadings and intercepts (c), more subjects (e), more observations per subject and state (f), and a lower unique variance (g) improved the intercept recovery.
10 Note that rotation is not yet included in LG, which is why we take the estimated loadings and rotate them in the open-source program R (R Core Team, 2002).

Goodness of unique variance recovery
To examine the recovery of the state-specific unique variances, we calculated the MAD between the true and estimated unique variances.
The unique variance recovery was good with an average MAD uniq of 0.04 (SD = 0.06) and no notable differences across the manipulated conditions. More prominently, the MAD uniq is affected by Heywood cases (Van Driel, 1978), which pertain to "improper" factor solutions with at least one unique variance that is negative or zero. When a Heywood case occurs, LG fixed the unique variance(s) to a very small number. A Heywood case is considered to be diagnostic of underdetermined factors and/or insufficient sample size (McDonald & Krane, 1979;Rindskopf, 1984;Van Driel, 1978;Velicer & Fava, 1998). Heywood cases occurred for 5,877 of the estimated data matrices (12.19%), where 89% of the Heywood cases indeed occurred in the conditions with the smallest number of observations per subject and state and/or the smallest number of subjects.

Model selection
For 74 out of the 80 data sets (93%), the correct model was selected, when considering the converged models only, and for 78 (98%) the correct model was among the three best models. Five of the six incorrect selections occurred for the data sets with four states and four factors and low loading differences as well as low intercept differences. Specifically, one state too few was selected which is explained by the low state separability in these conditions. 11 We conclude that the BIC performs very well with regard to selecting the most appropriate model complexity for LMFA. Note. For the between-state difference condition, the combination of no loading difference and no intercept difference was not included because this would imply that the MM does not differ across states.

Conclusions and recommendations
To sum up, LMFA is promising for detecting MM changes over time and for exploring what the MM differences look like and for which subjects and which time-points the MMs are comparable. However, the performance of the new method in recovering the correct state sequence and the correct state-specific MMs largely depends on model characteristics (i.e., the number of factors, the MM differences between the states, the unique variances, and the frequency of state transitions), and the within-state sample size. First, larger MM differences between states benefit the recovery of the states. Especially, intercept differences increased the separability of the states, to the extent that the states were recovered perfectly even under difficult conditions. Besides intercept differences, less factors, less frequent transitions between the states, and lower unique variances improved the recovery. Finally, all else equal, the within-state sample size greatly enhanced the state recovery. In the following, we list recommendations for empirical practice: • When the above-mentioned model characteristics are unknown (or assumed to be unfavorable), aim for 2,000-4,000 observations in total (subjects × observations) to obtain reliable results for 2-4 states.
• When favorable model characteristics are assumedthat is, when between-state differences are expected to be pronounced (e.g., changes in intercept are expected), transitions to be infrequent (e.g., measurement occasions are closely spaced) and unique variances to be low (e.g., using reliable measurement instruments)-800-1,000 observations in total (subjects × observations) are probably enough to obtain reliable results for 2-4 states. • The number of states that can be reliably captured is bound by the total sample size, and when the sample size does not allow for the "true" number of states to be estimated, the obtained results will only convey part of the MM differences present in the data.
States that correspond to a few observations only will not be detected. • Including more subjects in the study might be more feasible than obtaining more measurements from each participants, but be aware that in practice, subjects do not necessarily switch between the same set of MMs. LMFA allows for this heterogeneity in MMs, since not all subjects need to go through all states. Nevertheless, it is important to keep this potential heterogeneity in mind since it would imply that additional subjects increase the Note that the x-axis values 500 and 1,000 occur twice due to two possible combinations (5 × 100 = 10 × 5 and 5 × 200 = 10 × 100, respectively). The dots largely overlap, and thus state recovery was not visibly influenced by the combination, and thus, it did not matter whether the within-state sample size was increased by additional subjects or additional observations per subject.
number of MMs and thus the number of states. In that case, the number of observations per subject is essential for the sample size per state. • The BIC is a suitable criterion to decide on the best number of states and factors. However, when differences between the states are subtle, researchers are advised to consider the three best models and choose one based on interpretability and stability.

APPLICATION
In order to apply LMFA to the empirical data sets introduced in Section "Data structure and motivating example," we first selected the number of states and factors by comparing the BIC among LMFA models with 1-3 states and 1-3 factors per state. Models with four states or factors in a state failed to converge suggesting sample size limitations or model misspecification. The model [3 3 2] (i.e., three states with 3, 3, and 2 factors) was selected as it had the lowest BIC among the converged models and was the most interpretable. To shed light on the MM differences between the three states, we first looked at the state-specific intercepts (Figure 3). The intercepts are higher for PA items than for NA items in all the states. However, the difference between the PA and NA item scores is most visible in State 3 (hereinafter "pleasure state"), intermediate in State 2 (hereinafter "neutral state") and least pronounced in State 1 (hereinafter "displeasure state"). Second, we investigated the standardized oblimin rotated loadings (Table 3). As a notable similarity, we see that the positive items are collected into (i.e., load strongly on) a PA factor in all the states, although the strength of the loadings slightly differs between the states. A striking difference is that the pleasure state has an NA factor, whereas both the neutral and displeasure states have a "distress" factor with high loadings of "upset," "anxious," and "nervous"-although they slightly differ in that "calm" has an additional strong negative loading in the displeasure state, whereas "gloomy" and "sluggish" load on the distress factor in the neutral state only. The neutral and displeasure states are further characterized by a third factor. In the neutral state, the third factor pertains to "serenity" with strong loadings of "calm" and "relaxed." In the displeasure state, it is a bipolar "drive" factor indicating that being "determined" (strong negative loading) is inversely related to feeling "sluggish," "bored," and "listless" (strong positive loadings). This additional drive factor in the displeasure state concurs with theoretical models of anhedonia (Berridge, Robinson, & Aldridge, 2009;Treadway & Zald, 2011), which divide anhedonia in three dimensions: consummatory anhedonia (no longer enjoying pleasurable activities), anticipatory anhedonia (no longer looking forward to pleasurable activities), and motivational anhedonia (no longer experiencing motivation to pursue pleasurable activities). The drive factor confirms that motivation is distinct from general PAwhen individuals are anhedonic. Finally, the state-specific unique variances are listed in Table 3. In general, these are highest in the displeasure state, indicating more emotion-specific variability than in the other states. This concurs with research showing that emotional complexity is associated with higher levels of depression (e.g., Grühn, Lumley, Diehl, & Labouvie-Vief, 2013). In sum, LMFA allowed for us to find substantively meaningful changes in the MM, both in the number and nature of the underlying factors. As an important similarity between the states, it is found that PA is captured in each of the three states. 12 To investigate what potentially triggers the latent states, we explored between-state differences in evening measures on physical discomfort (such as headache) and the occurrence and importance of positive and negative events. We focus on descriptive statistics only since hypothesis testing for MM differences is beyond the scope of this article. A question was, for example, "Think about the most unpleasant event you experienced since the last assessment: how unpleasant was this experience?" The scales ranged from 0 ("Not at all") to 100 ("Very much"). Interestingly, when participants were in the displeasure state, they had experienced more unpleasant Moreover, we inspected how the states related to the interventions (Table 4). Before the intervention (Phase 1), participants were more often in the displeasure or neutral state than in the pleasure state. After the first intervention (Phase 2), the participants in the two intervention groups (i.e., PLA and PLA & SkyD) were more often in the pleasure or neutral state than in the displeasure state. After receiving a second intervention (Phase 3), the distribution across the displeasure and pleasure state stayed about the same or the occurrence of the pleasure state increased. Participants who did not receive an intervention after the first month were distributed equally across the pleasure and displeasure states and were mostly in the neutral state during Phase 2. Notably, in Phase 3, the state membership for these participants-that is, after receiving their first (self-chosen) intervention-changed in that the pleasure state was now also more frequent than the displeasure state when participants chose PLA & SkyD while it was the opposite for those who chose PLA, perhaps because the more depressed and anhedonic participants were the ones refraining from a SkyD. Looking at the examples of individual transition plots including the individual transition probabilities (Figure 4), it is apparent that participants switched quite often between states, in each phase of the study. This is coherent with previous findings that individuals with anhedonia and depression are often found to experience strong fluctuations in emotional experiences (Heininga, Van Roekel, Ahles, Oldehinkel, & Mezulis, 2017;van Roekel et al., 2015). Some participants switched more often between the states than others, which may pertain not only to between-subject differences in general stability and experienced events but also to differences in how one reacts to the interventions.
Summarized, the interventions appear to have increased the pleasure state membership and reduced the displeasure state membership, while leaving membership of the neutral Note. To enhance interpretation, factor loadings were standardized by dividing the loadings by the state-specific standard deviations of the item. state largely unaffected. It is noteworthy that participants receiving no intervention after the first month also slightly moved toward higher pleasure state membership and a lower displeasure state membership at that point in time. It appears that daily reflections on ones emotions also relieve anhedonia to a certain degree, which was already found in an intervention study using ESM in depressed patients (Kramer et al., 2014). Although these findings are merely exploratory and need to be validated in future research, we demonstrated that LMFA offers valuable insights to applied researchers.

DISCUSSION
In this article, we introduced latent Markov factor analysis (LMFA) for modeling measurement model (MM) changes that are expected to be prevalent in time-intensive longitudinal data such as experience sampling data. In this way, LMFA safeguards conclusions about changes in the measured constructs. MM changes may pertain to (potentially interesting) substantive changes or may signal the onset of response styles (RSs). Between-state differences in intercepts and loadings might suggest an extreme RS, whereas differences in intercepts only rather suggest an agreeing RS (Cheung & Rensvold, 2000). When one suspects a RS in a specific state, RS detection and correction (e.g., adding an agreeing RS factor; Billiet & McClendon, 2000;Watson, 1992) can be applied to that specific part of the data, rather than to the entire data set. Moreover, the subject-specific transition probabilities of LMFA capture, for example, to what extent each subject is likely to end up and to stay in an extreme RS state. Even when RSs are hard to distinguish, the fact that LMFA pinpoints MM changes-and thus the reliable and comparable parts of the data-is valuable in itself.
In the future, it would be interesting to go beyond the purely exploratory approach applied in this article. On the one hand, hypothesis testing to determine which parameters significantly differ between the states might be preferred over visually comparing the state-specific MMs. To this end, LG already provides the researchers with Wald test statistics when the rotational freedom of the state-specific factors is resolved by a minimal number of restricted loadings (e.g., Geminiani, Ceulemans, & De Roover, 2018). On the other hand, including explanatory variables (i.e., timeconstant or time-varying covariates such as personality traits or social contexts) in the model would allow to evaluate whether they significantly predict the state memberships and the transition probabilities.
Moreover, LMFA assumes normally distributed, continuous variables. However, categorical Likert-type scale ratings are frequently used in psychology. Although these data can often be treated as continuous in case of at least five response categories (Dolan, 1994;Olsson, 1979), the ratings are often    LATENT MARKOV FACTOR ANALYSIS skewed, thus violating the assumption of normality. Additionally, even continuous data, such as our application data, might be skewed. Therefore, the robustness of the method to such violations should be examined and, if necessary, possible extensions to deal with nonnormality should be considered.
In addition, longitudinal data are often collected in varying time intervals, for example, when testing the long-term influence of interventions on affect by collecting data in waves. In that case, the transition probabilities can no longer be considered time homogeneous, and continuous time modeling is necessary (Crayen, Eid, Lischetzke, & Vermunt, 2017). Therefore, in future research, we will develop a continuoustime extension of LMFA.
Moreover, a limitation of the method is the assumption that factor scores are centered around 0 and have a variance of 1. When factor scores evolve over time but the MM stays the same, changes in factor scores would currently be detected as intercept changes and thus possibly lead to different states according to model selection. In future work, we will investigate necessary LMFA extensions to explicitly model changes in factor means within the states, for example, depending on time or another covariate.
Next to that, we might consider an extension of LMFA using exploratory dynamic FA (EDFA; Browne, 2001;Zhang, 2006) within the states, which models the auto-and cross-lagged correlations of the factors at consecutive time-points but comes with important challenges. First, accurately estimating autocorrelations would require more measurement occasions per subject per state (Ram, Brose, & Molenaar, 2012), which may be undesirable. Second, in EDFA, factor rotation is more intricate since the auto-and cross-lagged relations between factors need to be rotated toward specified target matrices (Browne, 2001;Zhang, 2006), again necessitating the a priori hypotheses about (changes in) the MM we want to avoid. The LMM in LMFA already partly captures autocorrelations by the states, and uncaptured auto-and cross-lagged correlations will not necessarily introduce bias in the estimates of the state-specific MMs (Baltagi, 2011).
Finally, LMFA is a complex model with many assumptions. Therefore, misspecifications can occur, and tools to locate local misfit are essential. Local fit measures have been developed for related methods (e.g., bivariate residuals measures for multilevel data; Nagelkerke, Oberski, & Vermunt, 2016), but they need to be tailored and extensively evaluated for LMFA.

FUNDING
Finally, we can calculate the joint probability of two successive states by applying Bayes' theorem: In the M-step, we maximize E logL c ð Þwith respect to b θ. To maximize, we set the derivatives with respect to the parameters to zero, making use of Lagrange multipliers whenever a constraint, such as P K k¼1 π k ¼ 1, needs to be satisfied. The resulting updates are as follows: The factor models for the state-specific covariance matrices Σ new are estimated by another maximization algorithm within each M-step. Specifically, each observation is weighted by the corresponding γðs itk Þ-value, resulting in K weighted data sets Y k . Fisher scoring (Lee & Jennrich, 1979;Vermunt & Magidson, 2016) is used to perform factor analysis on these weighted data.

Convergence
The convergence can be evaluated either with respect to the log L or with respect to the parameter estimates. LG applies the latter approach and assesses convergence by computing the following quantity: which is the sum of the absolute values of the relative parameter changes where r ¼ 1; . . . ; R refers to the parameters. In this article, the stopping criterion is δ < 1 Â 10 À8 . The estimation also stops if the change in the log L becomes smaller than 1 Â 10 À10 prior to reaching the stopping criterion above.

Problem of the Additional Simulation Study
To test whether lag-one autocorrelations of factors not captured by the within-state EFA analyses harm the performance of LMFA, we 574 VOGELSMEIER ET AL. manipulated factors (a)-(g) as in the main simulation study, leaving out the levels with an already inferior performance, and added an eighth factor (h) specifying the autocorrelation.

Design and Procedure
We crossed eight factors in a complete factorial design 13 : a. number of factors per state F k at two levels: 2, 4; b. number of states K at three levels: 2, 4*; c. between state difference at two levels: medium loading difference and low intercept difference, medium loading difference and medium intercept difference*; d. unique variance e: fixed at 0.2, 0.4; e. frequency of transitions between the states at two levels: frequent, infrequent*; f. number of subjects N at three levels: 2, 10*; g. number of observations per subject per state T ik at three levels: 50, 100, 200; h. autocorrelation ϕ at three levels: 0, 0.3, 0.7* The data were generated by means of the orthogonal dynamic factor model which implies that, at time-point t, the factors are uncorrelated with one another but a factor's scores at time-point t are dependent on its scores at time-point t − 1. Specifically, they are autocorrelated by the coefficient ϕ as follows: where ε it is a subject-specific F k × 1 vector of noise at time-point t which is assumed to be multivariate normally distributed with zero mean and the identity matrix as covariance matrix (~MVN 0; I ð ÞÞ. Thus, to generate the subject-specific data sets Y i , first, the e it and ε it vectors were sampled for each observation. Subsequently, we created the autocorrelated factor score vectors f it by applying a recursive filter (Hamilton, 1994). This filter sets the first noise element as the first factor score and computes the remaining factor scores as in Equation (C1). The resulting factor scores were multiplied by ffiffiffiffiffiffiffiffiffiffiffiffiffi 1 À ϕ 2 q to retain an expected variance of 1 (De Roover, Timmerman, Van Diest, Onghena, & Ceulemans, 2014). Finally, the data-matrices Y i were again merged into one data set Y . Note that, for the strength of the autocorrelation (h), we used the values suggested by Cabrieto, Tuerlinckx, Kuppens, Grassmann, and Ceulemans (2017).
To check how the manipulation played out, we calculated the average autocorrelation across the data sets for each of the three conditions: they amounted to 0.05, 0.29, and 0.68. For each cell of the factorial design, 20 data sets Y were generated as described above. In total 2 (number of states) × 2 (number of factors) × 2 (between-state difference) × 2 (transition frequency between states) × 2 (number of subjects) × 3 (number of observations per subject and state) × 2 (unique variance) × 3 (autocorrelation) × 20 (replicates) = 5760 simulated data sets were generated. As in Simulation Study 1, the data were generated in R and analyzed in LG with the same settings and the correct number of states and factors per state.

Results
In general, the recovery was largely unaffected by the autocorrelation conditions (h). Specifically, the recovery of the transition matrices and unique variances was not affected − MAD trans = 0.05 and MAD uniq = 0.03 for -whereas the recovery of the states, loadings, and intercepts was only slightly affected-ARI ¼ :91 for ϕ ¼ 0 and ϕ ¼ :3 and :90 for ϕ ¼ :7; GOSL ¼ :99 for ϕ ¼ 0; ϕ ¼ :3 and :98 for ¼ :7; MAD int ¼ :07 for ϕ ¼ 0; :08 for ϕ ¼ :3 and :11 for ϕ ¼ :7. Note that the mild decrease in intercept recovery with an increased autocorrelation is merely a consequence of the higher variance of the estimated intercepts, since they capture part of the autocorrelation and thus vary more around the population values.