Dealing with covariate measurement error in a clustered cross-sectional survey

Abstract Many surveys are often complex cross-sectional studies that involve clustered data. Such surveys can have the additional complexity of the measurement error problem. Ignoring the measurement error problem and the clustering aspect may lead to incorrect inferences and conclusions. The purpose of this study was to demonstrate the application of regression calibration to correct for covariate measurement error in a clustered cross-sectional survey in a generalized estimating equations (GEE) framework. Methods that ignore both covariate measurement error and within-cluster correlation structure are compared to the proposed regression calibration-GEE method. The study found that clustering does not affect the association estimates adjusted for measurement error using regression calibration. However, the standard errors of the coefficient estimates are overestimated or underestimated in methods that ignore the within-cluster dependency despite adjusting for measurement error. Specifically, for clusters of size 10 and under unstructured and exchangeable correlation structure, the standard error was about 10.3% higher and 13.6% lower, respectively, in the method that ignores the within-cluster dependency than in the proposed method. From the findings of this study, we conclude that it is important to adjust for covariate measurement error in clustered data, while accounting for the within-cluster correlation.


Introduction
Many surveys are often complex in design and crosssectional in nature. These surveys make use of data collection tools that are prone to measurement error, for instance, self-reported questionnaires. Measurement error (ME) in exposures (or covariates) biases the association between the covariate and an outcome. The bias can be in any direction depending on the error structure (Agogo, 2017;Fosgate, 2006;Fuller, 2009;Hill & Kleinbaum, 2014;Stefanski, 1985). Study designs range from simple designs to complicated ones. In many crosssectional surveys with complex study design features, data within clusters are usually correlated (Akter et al., 2018;Hanley et al., 2003;Liang & Zeger, 1993;Neuhaus et al., 1991;Santos et al., 2008). Analysis of such data using standard methods that ignore covariate ME and clustering, may lead to invalid inference and conclusions. Regression calibration (RC) is a popular technique for adjusting for ME in a continuous covariate. Regression calibration is the conditional expectation of the true covariate, given the measured covariate and a vector of error-free covariates (Agogo et al., 2014;Carroll et al., 2006;Carroll & Stefanski, 1990;Freedman et al., 2008;Gleser, 1990). In a clustered survey, generalized estimating equations (GEE) approach is commonly used to account for the within-cluster dependencies, while estimating the association parameter of interest (Hanley et al., 2003;.
Currently, there is limited research focusing on correcting for covariate ME, while accounting for survey design simultaneously in cross-sectional surveys. In this work, we demonstrated how to apply RC in a GEE context to correct for covariate ME while accounting for within-cluster correlation. We re-emphasize the need to correct for ME in a covariate and simultaneously allow for correlation structure in clustered data.
The other sections of this paper are organized as follows: In section 2, we present the methods and materials for this study. Specifically, in section 2, we review the RC method and GEE approach, describe the simulation design and provide a real-data example. Simulation and real data results are presented in section 3. Section 4 provides a discussion and concluding remarks.

Regression calibration method
Usually, in epidemiological studies, it is impossible to observe the true covariate of interest, X. Instead, we observe a mismeasured covariate, Q. Regression calibration was first proposed by Carroll and Stefanski (1990), and Gleser (1990) as a method for correcting ME in the covariates. Regression calibration involves approximation of the conditional expectation of the true covariate given the mismeasured covariate and a vector of errorfree covariates (Freedman et al., 2008;Guolo, 2008;Küchenhoff & Carroll, 1997). The basic idea of RC is to replace X, which is unobservable, with an estimate Q calib , a function of the error-prone covariate Q and a vector of error-free covariates Z. Regression calibration is applicable under the assumptions that: (i) the measurement error in the observed covariate Q is nondifferential with respect to X; and a vector of error-free covariates. Non-differential error occurs when the measured covariate contains no extra information about the outcome other than what is contained in true covariate (Carroll et al., 2006), and (ii) the measurement error in the unbiased measurement, say, R of the true covariate X is uncorrelated with the measurement error in the observed covariate Q and with the true covariate, X. Noteworthy, R is a reference measurement from the calibration study.
Regression calibration is implemented in two main steps: Step 1. Estimating the calibration function. This involves estimating the conditional expectation of X given Q and Z, denoted by E½XjQ; Z� ¼Q calib ; (1) where Q calib is the calibrated version of Q. In the calibration function in equation (1) above, the unobservable true covariate X is replaced with R, which can be obtained from a validation, replication or instrumental data. Therefore, equation (1) can be re-expressed as Step 2. Using Q calib instead of Q in the standard analysis to obtain the parameter estimate that quantifies the association between the outcome and the covariate of interest given the error-free covariates. Zeger and Liang (1986) proposed the GEE to extend generalized linear models (GLMs) to analyzing correlated observations. The GEE approach requires the specification of the first two moments (mean and variance) of responses from the same cluster and a working correlation rather than the full specification of the joint distribution (Akter, Sarker, & Rahman, 2018). The GEE yields asymptotically unbiased regression coefficient estimates regardless of the specified correlation structure. The GEE estimates have marginal population-averaged interpretation.

The GEE approach
Assume that a population of size N is divided into K non-overlapping clusters of sizes n i (i ¼ 1; 2; . . . ; K) . . . ; n i be the j th response from the i th cluster and X ij ¼ x ij1 ; . . . ; x ijp � � be a vector of the corresponding p covariates. Using the GLM framework, the marginal expectation EðY i jX i Þ ¼ of regression coefficients to be estimated, X i is a matrix whose first column is a vector of 1's corresponding to the intercept terms and ϕ : ð Þ is the appropriate link function. For a binary response variable a logit link can be used such that the mean model can be expressed as We denote the working covariance matrix by and ρ i α ð Þ is the corresponding working correlation matrix, which depends on some vector of parameters α which is generally unknown.
Assuming that the structure of ρ i α ð Þ is known, the regression parameters β can be estimated by solving the GEE, The four commonly used correlation structures include the exchangeable, independence, autoregressive (AR) and unstructured structures. In the exchangeable structure, it is assumed that any two observations within a cluster are equally correlated with correlation ρ (fixed) but observations between clusters are assumed to be uncorrelated. For the i th cluster with size n i the exchangeable (or compound symmetry) correlation matrix can be expressed as follows: Horton and Lipsitz (1999) proposed the exchangeable structure as the appropriate correlation structure for handling data from a complex clustered design, where observations from the same cluster are not ordered chronologically such as in the case of longitudinal data. Under the independent (or scaled identity) correlation structure, it is assumed that there is no correlation between observations hence, no need for GEE. The independent working correlation matrix for the i th cluster can be expressed as follows: In the AR correlation structure which is more appropriate for observations made over time from the same unit, repeated observations that are close together in time are strongly correlated, and the correlation becomes weaker and weaker as repeated observations get further in time.
The correlation between, say the a th and b th observations in cluster i is given by ρ i α ð Þ ¼ ρ aÀ b j j , where 0 � ρ � 1, as shown in the AR(1) correlation matrix below: In the unstructured correlation structure, no constraints are put, and the correlation between different observations in a cluster can be different. Though this correlation structure is flexible, fitting such a correlation structure becomes computationally costly, as the number of parameters to be estimated increases with an increase in the number of observations in a cluster. Zeger and Liang (1986) proposed an iterative procedure for obtaining the GEE estimates b β of β under exchangeable correlation structure. The first step involves choosing the initial estimate β 0 of β, obtained by fitting a GLM considering the independence working correlation. In the second step, we set b β ¼ β 0 and calculate moment estimate α of α, for instance, for exchangeable working correlation matrix ρ α ð Þ is calculated as

GEE procedure
In the third step, the working correlation matrix ρ b α ð Þ obtained in the second step is used to update the current estimate β t using the Newton-Raphson method as Steps two and three are repeated until convergence to obtain b β of β.
The standard error (SE) of the GEE estimate is commonly calculated using the sandwich-based robust method. This is because the sandwich-based robust estimator is consistent and asymptotically unbiased, even under the mis-specification of the working correlation structure. The variance of b β, var b β � � is obtained by substituting the estimate of β at each iteration, and updating the following equation for the final estimate:

Monte Carlo simulations
In this study, we first use Monte Carlo simulations to show the application of RC in GEE for analyzing clustered data when the covariate is subject to ME. The simulations were conducted in R software. This section provides details of the simulation design, a description of the methods used and how the methods are evaluated.

Simulation design
For simplicity and without loss of generality, we focus on the following binary logit model with two regressors, one of which is subject to additive ME where X 1 ,N 5; 4 ð Þ, X 2 is binary covariate (assumed to be error-free), Q is the mis-measured version of X 1 . The additive error U is assumed to follow a normal distribution with mean 0 and variance σ 2 U ¼ 25. Noteworthy, the binary outcome Y is generated based on X 1 , X 2 ; and a pre-defined working correlation structure using the rbin function in SimCorMultRes package (Touloumis, 2016).
The unbiased version, R, of X 1 is simulated such that it contains a small additive ME, u, where u,N 0; 0:01 ð Þ We generate a total of 100 clusters with cluster sizes, n i 2 5; 10; 30; 90; 200 f g; assuming the commonly used correlation structures described in section 2.2. For illustrative purposes, the following n i � n i working correlation matrices are used in the simulation of the clustered observations: 1. For exchangeable correlation structure, we use a working correlation matrix of the form: 3. For unstructured working correlation, we first generate a positive definite covariance matrix, and then convert it to a correlation matrix. This is implemented in the clusterGeneration package (Qiu et al., 2015). 4. For the independence correlation structure, we model the simulated data using GLM.
Survey weights form a key feature of complexclustered surveys and are used to ensure that statistics calculated from data are more representative of the population of interest. To incorporate this feature, the binary covariate X 2 is simulated such that it contains two possible values, that is, Male and Female, with probabilities 0.6 and 0.4, respectively. To account for the simulation of the values of X 2 with unequal probabilities, we use the rake function in the survey package (Lumley & Lumley, 2007) to create weights for the simulated clustered data.

Calibration and methods description
The calibrated version of the observable mis-measured version of X 1 , Q calib , is the predicted value obtained in the linear regression of R on Q, and the error-free covariate, X 2 . Thus the calibrated exposure variable of interest is given by We compare the estimates of the association between the outcome and the covariate of interest obtained from the following described methods: M1 True GEE: This method relates the outcome (Y) and true simulated covariate (X 1 ) and an error-free covariate (X 2 ), taking into consideration the withincluster correlation structure.
M2 Naive GEE: In this method, we modeled the association between Y and (Q, X 2 ), taking into consideration the within-cluster correlation structure.
M3 Calibrated GEE: A method taking into consideration the correlation structure of observations within a cluster and relating Y and (Q calib , X 2 ).
M4 True GLM: In this method, we modeled the association between Y and (X 1 , X 2 ) without taking into consideration the within-cluster correlation.
M5 Naive GLM: A method that ignores both the covariate ME and within-cluster dependencies.
M6 Calibrated GLM: This method related Y and (Q calib , X 2 ) ignoring the within-cluster dependencies.
The methods are summarized in the flow-chart diagram shown in Figure 1
We compared the results obtained by using the methods described in section 2.3.2, under correctly specified within-cluster correlation structure and different cluster sizes (n i ). We also compared the results from the different methods when the within-cluster correlation structure is mis-specified. The simulations were repeated 500 times. A random seed was used to ensure the reproducibility of the results. We provide the mean coefficient estimates and Monte Carlo standard errors in the supplemental data for this article.

Application to real data
In this study, we illustrate the use of RC to correct for covariate ME in real clustered cross-sectional data. Specifically, we used a subset data of cigarette smokers extracted from the South African National Health and Nutrition examination survey 2011-2012 (SANHANES-1). The survey applied a stratified cluster sampling approach (Human Sciences Research Council, 2017). Enumeration areas (EAs) were the primary sampling units. The selection of EAs was stratified by province. Responses from the same EA are likely to be correlated in this survey, since they share the same cluster information. We focused on modeling the association between coughing status and smoking. In the study, smoking was quantified using the self-reported average number of cigarettes smoked per week. In addition to the average number of cigarettes smoked per week, some smokers reported the number of cigarettes smoked daily. The self-reported number of cigarettes smoked weekly is prone to ME, and therefore using such in modeling the association between coughing and smoking, yields biased estimates of the association.
We first adjusted for ME in the average number of cigarettes smoked per week before modeling the association between coughing and smoking. In this study, the number of cigarettes smoked daily was used to calibrate those smoked weekly in the following RC setting: where for j th response in the i th cluster, R ij = the number of cigarettes smoked daily, Q ij = the number of cigarettes smoked weekly, Z ij is an error-free covariate (in this case, gender) and Q ijcalib = the calibrated number of cigarettes smoked weekly. Taking into consideration the survey design features (i.e. clustering, stratification and sampling weight), we modeled the association between coughing status (1 = Yes, 0 = No), and the calibrated number of cigarettes as follows: where ϕ : ð Þ is a logit link function, Y ij is the coughing status of the j th individual from the i th cluster (EA), β 0 = the intercept term, β 1 = the coefficient estimate for the calibrated number of cigarettes and β 2 is the coefficient estimate for gender. We compared β 1 and its SE with those obtained when using a naive model under different correlation structure considerations. Table 1 shows the relative bias, standard error (SE), and the mean squared error (MSE) of the estimate of the association between the outcome, and the covariate of interest obtained using the methods described in section 2.3.2, under consideration of different cluster sizes and correctly specified working correlation structures. We considered clusters with 5, 10, 30, 90 and 200 observations. This facilitates a comparison of how the models perform at different cluster sizes.

Simulation results
The relative bias of the regression coefficient estimates obtained using the calibrated GEE, and calibrated GLM under different cluster sizes, and correctly specified correlation structures was close to zero. As the clusters become bigger, the relative bias approaches zero (Table 1). Negative relative bias is obtained when naive methods are used.
The results further showed that when the exchangeable and AR(1) correlation structures are correctly specified in clusters with 5,10 and 30 observation, the SE obtained when using the calibrated GEE method is larger than that obtained when using the calibrated GLM method. The SEs obtained in bigger clusters are essentially the same, for instance, for correctly specified AR(1) and n i ¼ 90, the SE obtained from both calibrated GEE and calibrated GLM is 0.014 and for n i ¼ 200, the SE is 0.009. A similar pattern is observed for the SEs obtained from naive methods. When the unstructured correlation structure is correctly specified, the SEs obtained under-calibrated GEE are slightly lower than those obtained with calibrated GLM.
The MSEs obtained when using the calibrated methods are smaller and closer to zero than those from the naive methods. With the naive methods, the MSEs  Table 2 are the results for the comparison of relative bias, SE, and MSE for the coefficient estimate of the association between the outcome and covariate of interest obtained using different methods, with a correctly specified and mis-specified withincluster dependency structure. With the calibrated GEE method, mis-specifying exchangeable correlation structure as AR(1) resulted in relatively higher bias. However, with the naive GEE, mis-specifying the correlation structure does not change the relative bias. A similar pattern is observed when AR(1) dependency structure is mis-specified as exchangeable. With the calibrated GEE method, mis-specifying the unstructured dependency structure as either exchangeable or AR(1) results in higher relative biases and SEs. Similar SEs are obtained under mis-specification of exchangeable and AR(1) correlation structures, whereas slightly higher SEs are obtained under the mis-specification of the unstructured correlation structure. The MSEs remain unchanged under the mis-specification of the dependency structures. For further details, see Table S 2 in the supplemental data for this article. Table 3 are the results obtained from analyzing real data as described in section 2.4. The results show that using the number of cigarettes smoked per week before adjusting for ME yielded lower odds of coughing than when the covariate is adjusted for ME. For instance, considering the exchangeable correlation structure, the odds of coughing is found to increase by 0:1% OddsRatio ¼ e 0:001 À 1 ð Þ per unit increase in the number of cigarettes smoked per week, under the naive model and by 0.4% when the number of cigarettes is adjusted for ME. Noteworthy, the coefficient estimates are approximately similar across the correlation structures considered but the SEs are different. The P-values obtained under the independence correlation structure are smaller than those obtained under either the exchangeable or AR(1) correlation structures.

Discussion and conclusion
In this study, we have shown the application of RC in GEE for analyzing data when the covariate is subject to ME. In the simulation study, we compared results from naive and calibrated models under a correctly specified and misspecified correlation structure. The relative bias of the regression coefficient estimates obtained using both the calibrated GEE and calibrated GLM models across different cluster sizes were close to zero, an indication that the Table 2. Comparison of relative bias, SE and MSE for the estimate of the association between the outcome and covariate of interest obtained using different methods with correctly specified and mis-specified dependency structure (n i =10)  coefficient estimates obtained after adjusting for covariate ME closely approximated the true coefficient. Furthermore, the results imply that RC is not sensitive to changes in cluster sizes and the within-cluster dependencies.
The negative relative bias obtained under the naive GLM is an indication that ignoring the covariate ME, led to the underestimation of the true coefficient. Our finding is in line with Stefanski et al. (1985), who noted that ME in covariates attenuates predicted probabilities in the logistic regression. Similarly, the underestimation effect was also observed in the method that considered the dependency structure but ignored the covariate ME. This is a clear indication that covariate ME in clustered data can lead to underestimation of the true association between the covariate and an outcome.
As expected, the SEs and the MSEs of the coefficient estimates were found to decrease with an increase in cluster sizes, due to the reduced uncertainty in estimating the true coefficient. Differences in SEs of the coefficient estimates obtained from the GLM and GEE models can be attributed to the within-cluster correlations. Small MSEs obtained when using the calibrated methods than when using the naive methods imply that better estimates are obtained under the calibrated models.
The results from the comparison of relative bias, SEs and MSEs of the coefficient estimate of the association between an outcome and a covariate subject to ME obtained under the mis-specification of within-cluster correlation structure, has some implications (i) misspecifying exchangeable working correlation structure as AR(1) and vice-versa can yield approximately similar results; (ii) mis-specifying unstructured correlation structure as either exchangeable or AR(1), can result into either smaller or larger coefficient estimates and SEs. AR(1) correlation structure is commonly used in longitudinal data and therefore, as proposed by Horton and Lipsitz (1999), and from the findings of our study, exchangeable correlation structure may be the only stable option for handling clustered cross-sectional data.
As a motivating example, we showed in this study, the use of RC to correct for ME in cross-sectional data from SANHANES-1. The results re-affirmed that ignoring ME in a covariate can underestimate the association between the covariate and an outcome in complex surveys. Furthermore, the results showed that ignoring the structure of correlation in clustered data can underestimate the SEs of the coefficient estimates (Hu et al., 1998;Ghisletta & Spini, 2004) , and produce smaller P-values (Ying et al., 2017) , irrespective of whether or not the ME in the covariate is corrected.
The study has the advantage that, apart from adjusting for within-cluster dependencies and covariate ME, it incorporates other survey design features such as stratification and sampling weights. Our study has a few limitations: (1) for simplicity and illustration purposes, we assumed that the covariate of interest is measured with classical additive error. However, in practice, the covariate can be measured with systematic error. In such a case, the systematic error components can be incorporated in the measurement error model in equation (11); (2) although a covariate can have a multiplicative measurement error structure (Heid et al., 2004), our study assumed an additive measurement error structure. A covariate measured with multiplicative error can be handled by first converting the multiplicative structure to an additive structure, through an appropriate transformation that linearizes the error structure.
From the findings of this study, we conclude that it is important to adjust for covariate ME in clustered data while accounting for within-cluster correlation.

Disclosure statement
No potential conflict of interest to declare.

Funding
This work was supported through the DELTAS Africa Initiative. The DELTAS Africa Initiative is an independent funding scheme of the African Academy of Sciences (AAS)'s Alliance for Accelerating Excellence in Science in Africa (AESA), and is supported by the New Partnership for Africa's Development Planning and Coordinating Agency (NEPAD Agency), with funding from the Welcome Trust [grant 107754/Z/15/Z-DELTAS Africa Sub-Saharan Africa Consortium for Advanced Biostatistics (SSACAB) programme] and the UK government. The views expressed in this publication are those of the authors and not necessarily those of AAS, NEPAD Agency, Welcome Trust, or the UK government.

Notes on contributors
Alexander K. Muoka is a PhD student in the School of Mathematics, Statistics and Computer Science at the University of KwaZulu-Natal, South Africa. He is an assistant lecturer in the Department of Mathematics, Statistics and Physical Sciences at Taita Taveta University, Kenya. He has research interests in covariate measurement error modeling, multivariate analysis, among others.
Henry G. Mwambi is a Professor of Statistics in the School of Mathematics, Statistics and Computer Science at the University of KwaZulu-Natal, South Africa. Henry has vast experience in modeling and analysis of biological and health outcome data including survival data, missing data, among others.
George O. Agogo is a biostatistician at the Centers for Disease Control and Prevention, Kenya. He has research interests in mixed modeling, covariate measurement error modeling, epidemiology, analysis of survival data, among others.
Oscar O. Ngesa is a Senior Lecturer in the Department of Mathematics, Statistics and Physical Sciences at the Taita Taveta University, Kenya. He has research interests in Spatial, Bayesian, food security and resilience analysis, among others.

PUBLIC INTEREST STATEMENT
Cross-sectional surveys are widely used to collect data from the population of interest. Features such as stratification and sampling weights form a critical part in designing surveys. Data collected from surveys are prone to measurement error. Measurement error in covariates/exposures is often ignored in statistical analyses, despite its adverse effects on the results. This study provides insights on how to model the association between an outcome and a covariate, while adjusting for measurement error in the covariate and addressing the within-cluster dependencies in clustered cross-sectional data. We hope that the findings of this study will positively impact how the public handles data from cross-sectional surveys. This will help present correct inferences from statistical analyses of survey data, advance science faster, and benefit society.

Data availability
SANHANES-1 data is made available to the researcher upon registration and agreeing to the terms and conditions of use in the Human Sciences Research Council (HSRC) website at http://curation.hsrc.ac.za/Dataset-565-datafiles.phtml.

Ethical statement
Ethics approval was granted by the HSRC Research Ethics Committee and was based on the Helsinki Declaration which has been adopted by the World Medical Association. Informed written consent or assent was obtained from each participant in the study. Participants were provided with written information on the study (including the background and objectives of the study) and their rights regarding participation and withdrawing at any time.