Evaluating the quality of survey and administrative data with generalized multitrait-multimethod models

Administrative register data are increasingly important in statistics, but, like other types of data, may contain measurement errors. To prevent such errors from invalidating analyses of scientific interest, it is therefore essential to estimate the extent of measurement errors in administrative data. Currently, however, most approaches to evaluate such errors involve either prohibitively expensive audits or comparison with a survey that is assumed perfect. We introduce the"generalized multitrait-multimethod"(GMTMM) model, which can be seen as a general framework for evaluating the quality of administrative and survey data simultaneously. This framework allows both survey and register to contain random and systematic measurement errors. Moreover, it accommodates common features of administrative data such as discreteness, nonlinearity, and nonnormality, improving similar existing models. The use of the GMTMM model is demonstrated by application to linked survey-register data from the German Federal Employment Agency on income from and duration of employment, and a simulation study evaluates the estimates obtained. KEY WORDS: Measurement error, Latent Variable Models, Official statistics, Register data, Reliability


Introduction
Register data and administrative records play an increasingly important role in statistics (Wallgren and Wallgren 2007) and policy (see, e.g., the Commission on Evidence-based Policymaking, https://cep.gov/), and several authors recommend and predict the increased use of "big data" (Entwisle and Elias 2013;Podesta 2014), including administrative register data (Japec et al. 2015). Uses to date include studies of how agricultural households affect land changes (Rindfuss et al. 2004), voter turnout (Ansolabehere and Hersh 2012), or how peoples' numerical ability relates to mortgage default (Gerardi, Goette, and Meier 2013). However, there is evidence that register data may contain considerable measurement errors (Groen 2012). For example, Bakker (2012, p. 15) estimated that 24% of the variance in Dutch official hourly wage records was random measurement error, and Ladouceur et al. (2007, p. 275) suggested that 20% to 30% of osteoarthritis cases are not registered in Quebec hospital administrative records, causing bias in prevalence estimates. The measurement error present in administrative records can severely bias and invalidate research results (Carroll et al. 2006;Saris and Gallhofer 2007;Vermunt 2010). It is therefore essential to evaluate the extent of measurement error in register data. (We use the terms "register data" and "administrative data" synonymously to avoid repetition.) The difficulty in studying error in register and administrative data, however, is that there is often no "gold standard" measure. Some authors have suggested to link administrative registers to a survey, assuming the survey contains no measurement error (e.g., Yucel and Zaslavsky 2005). But measurement error in survey data is widespread (Hansen, Hurwitz, and Bershad 1961;Hansen, Hurwitz, and Pritzker 1964;Felligi 1964;Andrews 1984;Alwin 2007;Saris and Gallhofer 2007;Biemer 2011), and is in fact often measured by taking administrative records as the "gold standard" (e.g., Kapteyn and Ypma 2007;Kreuter, Müller, and Trappmann 2010;Sakshaug, Yan, and Tourangeau 2010;Kim and Tamborini 2014). Thus, we often have two data sources, both measured with error, and we are interested in estimating the error in both.
Very few studies have attempted to estimate measurement error in both survey and administrative data simultaneously. Nordberg, Rendtel, and Basic (2004) discussed a longitudinal latent Markov model of measurement error in income, but again assumed the administrative register to be perfect in cross-sectional data; Pavlopoulos and Vermunt (2015) applied a similar latent Markov model to unemployment data; and Bakker (2012) and Scholtus, Bakker, and Van Delden (2015) estimated measurement error using linear factor analysis. However, the models used in these studies have several drawbacks when applied to administrative register data. First, true values of the variables of interest are often censored, zero-inflated, gamma, count, or nominal, and thus models that assume normally distributed true values are not appropriate. For example, income is usually zero-inflated and occupation is nominal. Second, the measurement error process in registers is likely to lead to nonnormal and nonlinear errors, yet many models used to study measurement error assume linear and homoscedastic errors. For example, top-coding of income causes nonlinear method effects (Gottschalk and Huynh 2010), and it is often thought that low earners over-report while high earners under-report, yielding "mean-reverting" random errors (e.g., Kim and Tamborini 2014). Third, the measurement quality of administrative data often differs over observations, yielding a mixture of measurement models. For example, the records may be obtained from a mixture of sources (Wallgren and Wallgren 2007), such as both employer statements and employee self-reports, or the variable may be more ambiguously defined for some cases than for others: the income of day laborers is an example. Earlier approaches have not accounted for such heterogeneity. Currently, then, there is no generally applicable method to evaluate the extent of measurement error in register and survey data.
Our contributions to the literature are threefold: first, we present a framework for simultaneously estimating measurement error in register and survey data that addresses the shortcomings of earlier methods. Second, we evaluate the finite sample performance of this model, as well as its robustness to misspecification of key assumptions. Third, we apply this framework to an important official register from the German Federal Employment agency. Section 2 introduces the modeling framework used to estimate the extent of measurement error in survey and register data simultaneously. Section 3 evaluates robustness of the model to misspecification, while Section 4 evaluates its finite sample performance. Section 5 applies the model to linked survey-register data on income of employment from the German Federal Employment agency.

Measurement Error Estimation From Multiple
Error-Prone Sources Measurement error in surveys has been extensively studied, and is often thought to stem from response, coding, processing, and interviewer errors in the data collection process (see Groves and Lyberg 2010;Biemer et al. 2017). Differences across respondents in the size of these errors will emerge as random noise in the observed variables. Moreover, because different survey variables are usually reported by the same respondent, distinct variables tend to share common errors, a phenomenon known as "method effect" in the literature (Andrews 1984). Because surveys contain both random and correlated errors, Andrews (1984) adapted the "multitrait-multimethod" (MTMM) design (Campbell and Fiske 1959) to survey measurement error estimation. The MTMM design can be described as a within-person experimental design crossing "traits" of interest with measurement "methods. " To estimate survey measurement error, Andrews and subsequent authors identified "traits" with survey questions, and "methods" with variations of these questions such as response scales (Saris and Gallhofer 2007). This approach has led to a large literature on MTMM modeling using confirmatory factor analysis (CFA) or structural equation modeling (SEM) to estimate the degree of random and systematic measurement error in survey data (e.g., Alwin 1973; Andrews 1984; Saris and Andrews 1991; Saris and Gallhofer 2007). Extensions for ordinal categorical data using the "ordinal factor analysis" model (Muthén 1983) have also been applied (Oberski, Saris, and Hagenaars 2008). Recently, Oberski, Hagenaars, and Saris (2015) introduced a latent class factor (Vermunt and Magidson 2004) MTMM model. Register data errors have been studied extensively in the statistical data editing literature (De Waal, Pannekoek, and Scholtus 2011). In this field, the primary goal has been to impute values suspected to be erroneous based on contextual information (covariates). Approaches based on Fellegi and Holt (1976) impose tables of edits while making as few changes as possible and leaving the joint distribution intact (Winkler 1999). Multiple imputation and Bayesian approaches also aim to impute corrected values, but do so based on a model specifying priors on unlikely combinations, and can quantify uncertainty due to edits (Little and Smith 1987;Ghosh-Dastidar and Schafer 2003;. Recently, Boeschoten, Oberski, and de Waal (2016) demonstrated how edit restrictions can be incorporated into a latent class model, merging the latent variable and model-based editing approaches.
In contrast with the goal of correcting records, the goal of estimating the extent of errors in registers has gained interest only recently. Registers are usually created through data entry and therefore also contain response, coding, and processing errors. However, in addition to these errors, administrative registers have been observed to contain errors that occur during the normal course of administration (Groen 2012). Among these register-specific errors are time lag, definition error, legally motivated ceiling effects, identification error, and harmonization error (Zhang 2012). Where registers are obtained from the same source, method effects may also occur (Bakker 2012). Moreover, the resulting relationship between true value and observed register value is often nonlinear, nonnormal, and differing over different administrative units.
The current methods for the estimation of the extent of measurement error in surveys and registers have important drawbacks, which we address in this study. For survey error estimation, the identification of "methods" with question design features implies that other sources of error aside from these specific design features are uncorrelated. For register error estimation, existing MTMM models lack the nonlinearity, nonnormality, and error process heterogeneity needed to realistically model measurement error in administrative registers. Furthermore, the statistical data editing methods, while useful for imputation, do not estimate the extent of measurement error present in a register. In the following section, we address these issues by presenting a novel generalization of the MTMM model.

The Generalized Multitrait-Multimethod Model
Our technique for simultaneously estimating measurement error in survey and administrative data builds on the multitraitmultimethod approach. Instead of identifying "methods" with survey question design, however, we consider the survey and register as "methods. " Given a set of variables of interest ("traits") for which observed measurements exist in both the administrative data and a sample survey, our goal is to estimate the degree of measurement error in variables observed in both sources.
Let y tm denote an observed random variable measuring the tth trait using the mth method. In the application described here, m will denote either the administrative or the survey measurement.
We use generalized latent variable models  to formulate a measurement model for MTMM data from an administrative register and a survey that can account for nonclassical error processes, nonnormal distributions, and categorical data. Generalized latent variable models are built up from (i) linear predictors; (ii) Generalized Linear Model (GLM) links and exponential family distributions; and (iii) conditional independence relations. To account for possible heterogenous error process, we additionally include (iv) a set of mixture components that allow the linear predictors to vary over components.
The conditional independence relations we use result from the MTMM design and are common to all MTMM models, whereas the choice of links and distributions is flexible: for this reason we call our approach a "generalized multitraitmultimethod" (GMTMM) model. The flexibility in links allows us to model nonlinearities and heteroscedasticities in the error process, while the choice of distributions for the latent variables allows for nonnormality of the true values. Finally, the optional finite mixture components allow error processes to differ over units. For example, measures obtained from different administrative databases are likely to have different errors (see Litson et al. 2016, for an example of a continuous-data MTMM model with such mixture components). A mixture component in which no relationship exists between true score and observed score may also be useful in the presence of linkage error (Larsen and Rubin 2001;Lahiri and Larsen 2005;Kapteyn and Ypma 2007).
The main ideas behind the GMTMM model are: r Observed survey and administrative register values are assumed to originate from a common underlying true value ("trait"); r The relationship between true and observed value is modeled as a GLM regression, allowing for considerable flexibility in nonlinearity and the distribution of measurement errors; r Differential error processes across different types of units can be modeled as a function of unknown mixture components. We now describe the GMTMM in terms of (i) the linear GLM predictors, (ii) the links and distributions, (iii) the model's conditional independencies, and (iv) the mixture option for heterogenous models.
(i) Linear predictors. For continuous observed data, linear predictors for the observed variables y tm are where, for identification purposes, the first loading of each trait factor η t and method factor ξ m is set to unity, λ t1 = γ 1m = 1. For categorical observed data, linear predictors for category y tm = k are where the first category can be chosen as a reference by setting τ 1tm = λ 1tm = γ 1m = 0 (e.g., Vermunt and Magidson 2013). At times, "paradata" may be available that were captured during the process of survey and register data collection (Kreuter 2013). Examples for surveys include response times, behavior codes, and vocal pitch; for registers, paradata have not been widely studied, but might include the age of the record or the quality control budget of the department that produced it, if this differs across records. Where such data are informative about the measurement error process, they can be included as covariates in the linear predictor and allowed to interact with the latent "trait" variable. Denoting the paradata covariate as z, the linear predictor then becomes allowing for both a shift (δ tm ) and a different measurement relationship (β tm ) across values of the paradata covariate z. To simplify the discussion below, we will omit such covariates from the likelihood.
(ii) Links and distributions. Each of the observed and latent variables is assigned a distributional "family" and a link function g(·) connecting the linear predictor to the expectation of the response y tm is chosen, depending on whether the observed variable is continuous or categorical. Different observed variables may be assigned different link functions and distributions. We denote the choice of the conditional distribution of the observed responses given the latent variables as f y := p(y tm |η t , ξ m ) with parameter vector θ y . Similarly, the multivariate distribution of the latent "true score" variables is denoted f η with parameters θ η and the distribution of the latent "method" variables f ξ with parameters θ ξ . Depending on whether the variables to which they refer are continuous or categorical, f y , f ξ , and f η may be probability density or probability mass functions.
A possible extension, which we do not consider here, is to condition the true score distribution f η on covariates that define edit restrictions (see Boeschoten, Oberski, and de Waal 2016, for an application of this idea to latent class models). For example, if η represents "married" (1) versus "not married" (0), f η could be chosen as a logistic regression on a binary covariate "age < 16" (1) versus "age ≥ 16" (0), possibly with fixed strongly negative regression coefficient. This would then impose the edit restriction that married persons must be age 16 or above. Estimating the coefficient from data would impose a "soft edit" (De Waal, Pannekoek, and Scholtus 2012).
(iii) Conditional independencies. The specification of the homogenous generalized latent variable model is completed with assumptions of conditional independence that are necessary for identification of the model parameters from observables. These assumptions mirror those of the linear MTMM model. Assumption 1. The observed variable y tm is conditionally independent of all other observed variables given its trait factor η t and method factor ξ m .
Assumption 1 implies that the joint conditional distribution of observed given latent variables can be factored into the univariate conditional distributions, that is, Assumption 2. The latent method factors ξ are mutually independent and independent of the trait variables η.
Assumption 2 implies that the latent variable joint distribution can be factored into Note that there may still be dependencies among the latent trait variables in the vector η. Homogenous GMTMM likelihood. When the error process is thought to be homogenous, the marginal likelihood p(y|θ) is where assumptions 1 and 2 are used and the integral is defined as a sum for discrete latent variable distributions.
(iv) Heterogenous error processes. For heterogenous error processes, in which a mixture of error processes is thought to be present, define p(y|S, θ s ) as the component-specific marginal likelihood, with component specific parameters θ s . Typically, it is the measurement parameters that are thought to differ over components, that is, the linear predictors are given an additional subscript ν tm,s .
An example of heterogenous error results from linkage error: similar to the regression model suggested by Lahiri and Larsen (2005), in mislinked records the register would be unrelated to the true value of the survey respondent, which can be modeled by specifying a two-component mixture with λ tm,2 = 0. Another example occurs when administrative delays occur for some units but not others, so that τ tm,s , γ tm,s , and λ tm,s differ. The number of mixture components may be selected using standard methods such as comparison of Bayesian information criterion (BIC) or Akaike information criterion (AIC); (McLachlan and Peel 2004).
To model such differences, we introduce an unobserved discrete variable S with categories equal to the number of components, so that the marginal likelihood of the observed data becomes Since the mixture proportions p(S) are typically unknown, this implies an additional |S| parameters in θ to be estimated.

Identification and Estimation of GMTMM Models
Consistent estimates of the parameters θ can be obtained from observations on three traits from linked survey-register data when these parameters are identifiable. The appendix shows that all parameters of the homogenous GMTMM model are locally identifiable almost everywhere in the parameter space (see Allman, Matias, and Rhodes 2009) under mild assumptions, given linked survey and register measures of three variables ("traits"). For heterogenous error processes, identifiability remains an open problem analytically. Even when local identifiability does occur almost everywhere, in practical applications the information matrix can be observed to approach singularity ("empirical underidentification;" Kenny and Kashy 1992). Specifically, this occurs when the maximum of the likelihood lies close to a point that violates one of the assumptions outlined in the appendix. Considering this issue and pending analytical results for heterogenous GMTMM models, we suggest to verify (1) empirical identification on data at the converged solution by examining the rank of the information matrix numerically, and (2) following Forcina (2008), to verify invertibility of the information matrix numerically at a large number of random parameter values.
Standard estimation procedures for generalized latent variable models can be used to estimate the GMTMM model (e.g., Skrondal and Rabe-Hesketh 2004, chap. 6). The most general is to use standard optimization algorithms to maximize the marginal likelihood from Equation (7) or (8). For certain models, such as latent class MTMM models, direct maximization of the marginal likelihood may become unstable. An expectationmaximization (EM) algorithm (McLachlan and Krishnan 2007) or Markov chain Monte Carlo (MCMC) sampling of latent variables and parameters can be used by considering the latent variables ξ, η, and S to be missing data.
Certain special cases of GMTMM models, including the examples given above, can be estimated using standard software for latent variable modeling such as Latent Gold (Vermunt and Magidson 2013) or GLAMM (Rabe-Hesketh, Skrondal, and Pickles 2004), that implement this estimation strategy. Moreover, specialized efficient estimation procedures already exist for certain special cases of the GMTMM model. For example, the linear factor analysis MTMM model can be formulated as a covariance structure model with a closed-form marginal likelihood (Bollen 1989). The ordinal factor analysis (cumulative probit) model can be similarly dealt with by first computing polychoric correlation coefficients (Muthén 1983). Such models can be fit using standard software for structural equation modeling. Other possible combinations of choices may require specialized software, or can be implemented in general-purpose software such as Stan (Carpenter et al. 2017). An example of a GMTMM model that requires such additional effort is provided in the online supplement with accompanying Stan code.
This section introduced a generalized multitraitmultimethod model that can be used to estimate measurement error when at least two separate measures of at least three different phenomena are available. The GMTMM model can deal with nonnormality of true values, nonlinearity and heteroscedasticity of errors, and the existence of unknown groups that exhibit differential measurement error. It is therefore applicable to estimating measurement error in administrative register data and surveys simultaneously. It is also more generally applicable to situations where such error structures are thought to exist in multiple error-prone sources.

Asymptotic Robustness to Misspecification
The GMTMM model relies on assumptions of independence. When these assumptions are violated, an important question is the extent to which such violations affect the estimates. This section therefore studies the asymptotic sensitivity of GMTMM model estimates to misspecifications of the independence assumptions.
To study robustness, we examine the asymptotic bias of a misspecified GMTMM model. Two key assumptions are examined: (1) the assumption that traits and methods are marginally independent; and (2) the assumption that methods are mutually marginally independent. We study true models, M, say, that violate these assumptions to different degrees, and examine how much asymptotic bias occurs under misspecified models, M, that make both assumptions. Following Kuha and Moustaki (2015), this asymptotic bias can be obtained by maximizing the expectation of the misspecified likelihood, p M , under the correct model M , We accomplish this by studying a fully categorical three-trait, two-method GMTMM model, in which all traits, methods, and observed variables are binary variables. This specification is convenient for two reasons. First, as argued by Allman, Matias, and Rhodes (2009), properties of discrete latent variable models will generalize approximately to continuous-data models. Second, the binary formulation of the model makes it feasible to maximize Equation (9) by enumerating all possible response patterns of y and their expectation under the true model. After calculating these for each condition, a misspecified GMTMM model is fit using expectation-maximization to the true-model expected proportions (see also Rotnitzky and Wypij 1994;Heagerty and Kurland 2001;Biemer 2011, who use a similar approach to study sensitivity to misspecification in other types of models).
To study sensitivity of the parameter estimates to misspecification of trait-method ψ (tm) and method-method dependency ψ (mm) , we vary the following factors: r The size of the log-linear trait-method dependencies: r The size of the log-linear trait slope: λ tm ∈ {1, 2, 4}; r The size of the log-linear method slope: γ tm ∈ {0.5, 1.0}.
The parameter values of the true models M were chosen to correspond to a very wide range of plausible situations. For example, the setting λ tm = 1 corresponds to an approximate reliability (Pearson correlation between observed variable and trait) of 0.50, while the highest setting λ tm = 4 corresponds to a reliability of about 0.96. The reliability therefore varies from terrible to excellent. Similarly, the method effect expressed as a Pearson correlation varies between zero and 0.2, which was indicated to be a commonly encountered situation in continuous data by Saris and Gallhofer (2007). For trait-method and method-method dependencies, less guidance is available, but it appears plausible that such dependencies, when present, would not be much stronger than the dependencies among the substantive latent variables. The chosen range (−1, 1) can maximally shift a probability by about 0.5, which appears to be a reasonably strong dependency.
Crossing all factors yields a 7 × 7 × 3 × 2 full factorial design with 294 conditions. For each condition, we generate the expected proportions under the true model p M (y|θ) and maximize the likelihood of the misspecified model M, which incorrectly assumes trait-method and method-method independence, yielding biased parameter values. The asymptotic bias is then defined as the difference between these values and the true values. The outcomes of interest are asymptotic bias in (1) the trait slopes λ tm , (2) the method slopes γ tm , and (3) the trait-trait dependencies φ tt .
The full tables of results from all 294 conditions are available in the online appendix. An ANOVA of the bias in each of the three outcomes of interest is shown in Table 1. This table shows mean squares for the bias in the three outcomes of interest, using a model with main effects as well as secondorder interactions between the misspecification and loading size factors. This summary demonstrates that the largest deviations in the asymptotic bias are accounted for by the traitmethod dependency ψ (tm) and trait loading λ tm . GMTMM estimates appear to be most sensitive to these factors and their interaction. Table . ANOVA with main effects and second-order interactions. The trait-method dependency ψ (tm) and trait loading λ tm account for the largest mean square.
Mean square for bias in...
To illustrate the size of the asymptotic bias and demonstrate how misspecification relates to it, Figure 1 plots the true traitmethod dependency ψ (tm) against the asymptotic bias for the three main outcomes. The columns of this figure correspond to the three outcomes of interest: respectively, asymptotic bias in the trait-trait dependency, the method slopes, and the trait slopes. The rows correspond to conditions with different values of the trait slope. Each plot in the figure shows a misspecification on the horizontal axis, and the incurred asymptotic bias on the vertical axis. Boxplots show the distribution of the bias, while the solid lines connect the median biases encountered. Figure 1 demonstrates the effect of incorrectly assuming traits and methods to be independent. As expected, at the points corresponding to correctly specified models (intersections of dotted lines), no bias occurs. However, as misspecification increases in either direction, the parameter estimates of a misspecified model will incur some bias. The figure shows that the trait-trait dependency estimates (first column) are most strongly affected by this misspecification. This bias is attenuated as the measurement becomes better (rows), but still considerable at the most extreme values of T-M dependency. The method slopes (second column) are less strongly affected, and appear strongly biased only at the extremes. Finally, the trait slopes, while also affected by this misspecification, do not appear highly sensitive to it. Figure 2 shows the effect of incorrectly assuming methods to be mutually independent: it plots the same results as Figure 1 as a function of the true method-method dependence, ψ (mm) . The flatness of the median bias lines of this figure relative to those in Figure 1 shows that the estimates are rather robust to misspecification of the dependency structure among methods. As the measurement improves (lower rows), this robustness also increases.
The robustness study performed here demonstrates that GMTMM models may be most sensitive to the assumption of zero trait-method dependency. However, serious biases were only observed at relatively large trait-method dependencies. GMTMM parameter estimates appear to be relatively robust to the assumption of zero method-method dependency. Finally, the parameters of primary interest, the method and trait loadings, were less affected by either type of misspecification than the parameters specifying the joint distribution of true values ("traits").

Simulation
We demonstrate some key finite sample properties of the maximum likelihood estimates of GMTMM model parameter estimates using a simulation study. Since there are many possible GMTMM models that fall within this framework, we choose a model and parameter values based on our application to linked survey-register dataset obtained from the German Federal Employment Agency, and summarize bias and standard error accuracy under different conditions corresponding to sample sizes.
The response model chosen for the observed variables is a censored regression in which the unobserved trait and method variables are the regressors and the dependent variables are six observed indicators corresponding to the crossing of three traits and two methods. Thus, the response model for the observed  variable y tm measuring trait t with method m is where y * tm follows the linear factor model, (13) The latent variables themselves are discrete interval-level variables with a multinomial distribution parameterized using the log-linear model where μ k 1 k 2 k 3 = 3 t=1 α tk t + φ 12 η 1,k 1 η 2,k 2 + φ 13 η 1,k 1 η 3,k 3 + φ 23 η 2,k 2 η 3,k 3 . This model, depicted in Figure 3, yields the following set of parameters, corresponding to the observed variable intercepts τ tm , trait loadings λ tm , method loadings γ tm , error variances σ ,tm , as well as the latent variable log-linear intercepts α tk , and κ tk and latent log-linear associations φ tt : Furthermore, corresponding to the selected model from our application, we choose three categories for the latent trait and two for the latent method variables: To ensure parameter values are realistic, we set them to the maximum-likelihood estimates found in our application, and vary the sample size across conditions, n ∈ {200, 500, 1000, 2000}. The results of simulating data from this model and analyzing them using the GMTMM model are summarized in Table 2.

Figure .
A generalized multitrait-multimethod (GMTMM) model for three "traits" using administrative data and a survey as measurement "methods. " The example traits signify personal income from full-time, part-time, and other kinds of employment over a certain period.  Table 2 summarizes the bias, defined as the difference between the true parameter value and the simulation average of the maximum likelihood estimate, as well as the ratio between the average simulation standard error and standard deviation over replications ("s.e./sd").
It can be seen in Table 2 that under all conditions, the bias is small for most parameters and the estimated standard errors accurately reflect the simulation standard deviation. Exceptions to this good performance are the latent variable intercepts (e.g., α 21 and κ 11 ) in the condition with the smallest sample size (n = 200). Although the bias in this condition is smaller for the other latent intercept parameters, there is a clear pattern of overestimating the size of the largest class and underestimating that of the other classes. This bias disappears as the sample size grows larger. The other parameters do not appear to show any bias, even at the smallest sample size. Table 2 also shows the performance of information-based standard errors as an estimate of simulation standard deviation. The standard errors perform well when sample size is at least 500. In the smallest sample size condition, some of the standard errors tend to underestimate the simulation standard deviation, which will lead to undercoverage of confidence intervals.
In summary, while the performance of the maximumlikelihood estimates is generally good, bias in some of the parameter estimates and many of the standard errors occurred when the sample size is small (n = 200). Therefore, we recommend to use the GMTMM model with samples of at least 500 linked cases.

Application to Administrative Data on Income
We applied the GMTMM model to a unique dataset provided by the research institute of the German Federal Employment Agency (Bundesagentur für Arbeit, BA). The BA's normal operations include job placement and payment of benefits, and for these purposes it maintains an extensive database of citizens' (un)employment histories dating back to 1975. This database covers German employees who are subject to social security contributions as well as recipients of entitlements, comprising about 86% of the overall German labor force. Excluded from the register are most civil servants, the self-employed, and others who have never been in contact with the Agency, such as the never-employed.
Both survey data and the BA's register data are routinely used for labor market and policy research-especially those on income from employment. For consenting respondents, we gained IRB approval to link administrative record data from the Agency with a telephone survey conducted by the Institute Table . Fit of GMTMM models for the measurement error in administrative and survey data on income. Rows correspond to models with different numbers of categories K for the latent true score ("trait") variable η t . for Employment Research (Institut für Arbeitsmarkt-und Berufsforschung, IAB). Restricted access to the anonymized linked survey-administrative data was provided at the Agency's offices (IAB Beschäftigtenhistorik (BEH) Version 09.01.00, Nürnberg 2012); the raw data cannot be made publicly available for legal reasons.
Particularly of interest are the BA's records on income from full-time, part-time, and "marginal" employment. "Marginal" employment, also known as a "Minijob, " is a common form of low-income employment in Germany, yielding monthly income of up to 400 Euro; at or below this maximum, the employee is exempt from income taxes and social security (at the time of data collection).
However, exactly because the income data were collected for the BA's administrative purposes, measurement error can become a serious issue for research in spite of reporting accuracy, because measurement errors in administrative data need not come from the reporting itself (Groen 2012). For example, although the employers will presumably fulfill their mandate to report accurately, when compiling historical records there may be mismatches and time lapses in an individual's record. Similarly, self-employment periods are absent from the records, again leading to a mismatch in "last part/full-time job, " for instance. These issues will lead to random and correlated measurement error for research purposes.
To obtain the survey measurement, a stratified sample of 2400 respondents was asked to provide information on income from full-time, part-time, and marginal employment (see Eckman et al. 2014, for further description of the sample design). The survey had a response rate (AAPOR RR1) of 19.4%. In the following analyses, we accounted for the sample stratification using complex sampling adjustments. Of the respondents, 2284 (95%) provided informed consent to record linkage between the survey and the administrative registers. This linkage could be performed using unique person identifiers, so that it seems reasonable to assume no linkage errors were present. By linking the administrative data to the survey data, we thus obtained MTMM designs with three traits and two methods.
The register provides income data only at the level of employment spells. This typically corresponds to an annual basis if a respondent was employed at the same employer throughout a given year. The survey, however, explicitly asks for the last monthly income from gainful employment which is the standard reference period used in most German surveys. Assuming that salaries are paid evenly throughout the employment spell, the administrative data were converted to a monthly basis.

Estimates of Reliability and Method Effects in Survey and Administrative Measures
To estimate the quality of the administrative register as well as the survey answers on income data, we adapt the model to recognize several aspects of the measurement process: r Following the econometrics literature (Tobin 1958), censoring in income is accounted for; r The relationship between true income and reported income is thought to be nonlinear (Kim and Tamborini 2014); r Previous studies linking survey and register data (Scholtus, Bakker, and Van Delden 2015) suggested that there is a subgroup of respondents for whom the two measures correspond exactly, whereas for others they do not, possibly suggesting a heterogenous error process; r There is a strong incentive to misreport one's income from a "Minijob" as being equal to or below 400 euros, since at the time of the survey this was the legal maximum income to qualify for tax exemption and social security exemption (see sec. 8 SGB [Social Security Code]). Due to these factors, a linear Gaussian MTMM will not suffice. Instead, we choose f y to be the standard censored regression equation, use the "nonparametric" latent class factor analysis formulation of f ξ and f η to allow for nonlinearity (Oberski, Hagenaars, and Saris 2015), and investigate whether an additional mixture component of S in which the response is unrelated to the true value fits the data more closely than a homogenous error structure. This model is no longer a standard structural equation model but can be estimated in the software for latent class (factor) analysis Latent GOLD 5.0 (Vermunt and Magidson 2013). Program input can be found in the Appendix.
The latent class factor analysis model does not impose a distribution on the latent trait and method factors, but instead approximates these distributions by discrete interval-level latent variables whose category sizes are estimated from the data (Vermunt and Magidson 2004). Moreover, the possibility of a heterogenous error structure suggests the presence of an additional discrete nominal latent variable S. Since the number of categories for the latent trait, method, and error structure variables is unknown, we compare the fit of models with differing numbers of categories for each of these. Since increasing the number of categories of the method factors and the error structure variables beyond two never improved the model, we only show these comparisons for models with differing numbers of categories K for the latent trait variables (η t ), with (|S| = 2) and without (|S| = 1) a heterogenous error structure. Table 3 shows the fit of these models in terms of loglikelihood (LL), BIC, and AIC, as well as the number of parameters these models have. The model with three latent categories and a heterogenous error process fit the data best in terms of BIC and AIC. This result suggests that there may indeed be differing error processes for different respondents. Since the model fit did not improve when increasing the number of latent categories from three to four, we selected the threeclass heterogenous model. In other words, we approximate the distribution of true latent income with a discrete three-category latent variable for which the category sizes are estimated. We also allowed for some proportion of the observations to be unrelated to the true value, for example, because some fixed value (such as 400 euros) was always chosen in this group regardless of the true income. Table 4 shows the expected means of the administrative and survey measures of log-income for different categories of the latent trait and method variables. The table illustrates how the observed measures are estimated by the model to relate to the respective latent variables. The relationships in Table 4 are marginalized over the two categories of the error process latent variables S. Thus, the table shows how the relationship holds for a respondent whose error process is not known in advance. The estimated proportions of units in each class of S are 0.95 and 0.05. In other words, about 5% (not shown in the table) are estimated to belong to the latent category in which a random value is given-that is, a value that is unrelated to the trait or method variables.
The model is no longer linear, so that reliability and method effect coefficients, which represent (linear) correlations are more difficult to interpret. However, it is possible to calculate the model-implied reliabilities cor(y tm , η t ) and method effects cor(y tm , η m ). These estimates, with confidence intervals based on bootstrapped standard errors, are shown in Figure 4. The figure shows that while the administrative data on income from full-time and marginal jobs are estimated to be superior to the survey measures, the survey measure has a stronger linear correlation with true income level from part-time work. A possible explanation for this difference is a change in mandatory reporting procedures regarding part-time employment in the year 2011. On the other hand, the survey measures do exhibit a strong method dependence, whereas again the administrative register measures were estimated to have no such method dependence.
In summary, we found for official administrative data obtained from the German Federal Employment Agency that the reliability of both survey and administrative data was far from perfect. Estimated relationships between these observed variables and other variables of scientific interest will therefore be biased. Moreover, for some of these measures, method effects were found. Such method effects, when ignored, will cause spurious relationships among the true income score ("traits") of interest. When using administrative data, method dependence may be less of a concern. To prevent biases arising from measurement error in substantive analyses of income data, correction methods for known error processes may be needed (e.g., Saris and Gallhofer 2007;Vermunt 2010;Skrondal and Kuha 2012).

Discussion and Conclusion
We showed how the quality of survey and administrative data can be evaluated using generalized multitrait-multimethod (GMTMM) models. This approach is an improvement over existing methods, which assume that either the survey or the administrative data are perfect measures. A general framework for data quality evaluation was introduced. This framework is more suited than existing MTMM approaches to administrative data particularities such as categorical measurement, nonlinearities, heterogenous error processes, and nonnormality. We demonstrated the use of GMTMM models by applying them to administrative and survey data on income of employment from the German Federal Employment Agency. A simulation study demonstrated good properties of the maximum-likelihood estimates for a GMTMM model with moderate sample sizes, and a robustness study indicated that parameter estimates are not highly sensitive to identifying assumptions.
A clear advantage of our approach is that it allows for the presence of measurement error in both the survey and the administrative register. Furthermore, using the administrative register as a second measure in the MTMM design has an additional advantage over classical MTMM designs using repeated survey measures. When repeated survey measures are used, survey respondents must answer questions on the same topic twice and may remember their answer, creating dependencies that are not modeled (Alwin 2011), although van Meurs (1995 provided some evidence that this might not occur in practice when sufficient time is allowed between the repetitions. The problem of memory bias does not occur, however, when the measurement methods are administrative and survey data collected separately. Therefore, in addition to allowing for the estimation of measurement error in administrative records, the MTMM design using linked survey-register data is an attractive method of estimating measurement error in survey variables. Some limitations of our work remain. First, our model assumed that traits and methods are independent. While the robustness study indicates that the parameters of primary interest may not be highly sensitive to this assumption, it cannot rule out that very strong dependencies between traits and methods will produce bias. We note that it is possible to define a subclass of identifiable GMTMM models that do allow for dependencies among traits and methods, and between methods (linear MTMM models are known to lie outside this subclass, e.g., Kenny and Kashy 1992). However, this subclass will rely heavily on higher-order moments for identification, which in practice may lead to high-variance estimates. In future studies, it will be of interest to investigate the conditions under which such models can be applied.
Second, we did not discuss model fit evaluation. This issue is not specific to GMTMM modeling, so that the standard machinery available for global and local fit assessment in generalized latent variable models can be applied to GMTMM modeling (see, e.g., Skrondal and Rabe-Hesketh 2004;Oberski, Van Kollenburg, and Vermunt 2013). Second, little is known about the small sample properties of GMTMM model estimates. While simulation results by Scholtus, Bakker, and Van Delden (2015) on the linear MTMM model were positive, other types of GMTMM models were not evaluated. This remains a topic for future research. Finally, in our application on German data, unique identifiers were available that allowed for close linkage between the survey and register. In other applications, however, such identifiers may not be available for legal reasons or they may not exist. In such cases, linkage error will occur as well as measurement error. As indicated, the heterogenous error process may be employed to model such errors in a fashion similar to the observed multiple regression models of Lahiri and Larsen (2005). However, evaluating the performance of this solution and the interaction between linkage and measurement error remains a topic for future study.