Empirical likelihood inference and goodness-of-fit test for logistic regression model under two-phase case-control sampling

ABSTRACT Due to cost-effectiveness and high efficiency, two-phase case-control sampling has been widely used in epidemiology studies. We develop a semi-parametric empirical likelihood approach to two-phase case-control data under the logistic regression model. We show that the maximum empirical likelihood estimator has an asymptotically normal distribution, and the empirical likelihood ratio follows an asymptotically central chi-square distribution. We find that the maximum empirical likelihood estimator is equal to Breslow and Holubkov (1997)'s maximum likelihood estimator. Even so, the limiting distribution of the likelihood ratio, likelihood-ratio-based interval, and test are all new. Furthermore, we construct new Kolmogorov–Smirnov type goodness-of-fit tests to test the validation of the underlying logistic regression model. Our simulation results and a real application show that the likelihood-ratio-based interval and test have certain merits over the Wald-type counterparts and that the proposed goodness-of-fit test is valid.


Introduction
Case-control studies have been conducted extensively in epidemiology and medical research, especially for rare diseases like cancer, as an easy and quick way of comparing treatments, or investigating the causes of disease. In a case-control study, the possible covariate information associated with diseases is collected for diseased and non-diseased individuals, and logistic regression models are usually used to model the relationship between the binary disease status and the covariate (Breslow & Day, 1980;Farewell, 1979;Prentice & Pyke, 1979). However, epidemiological and medical studies often require the collection of information on a large number of variables, including lifestyle, occupation, and environmental conditions. Certain variables are collected easily such as disease status and age in the social security system, but certain variables require a lot of cost, such as lifestyle. The difficulty can be overcome by employing a two-phase, two-stage or double sampling (Breslow et al., 2009;Neyman, 1938;Schaid et al., 2013;Thomas et al., 2013), where in the first phase, a relatively large number of individuals are sampled from a target population and information on variables that are easier to measure is collected. Together with a case-control sampling in the first phase, this leads to two-phase case-control sampling (Walker, 1982;White, 1982). At the first phase, cases and controls are sampled from a general population, Supplemental data for this article can be accessed here. https://doi.org/10. 1080/24754269.2021.1946373 and some stratifying information, such as a crude measure of exposure (e.g. disease or non-disease) is obtained. At the second phase, subsamples are selected within strata defined by disease status and by other stratifying information, and more refined information on exposure and other covariates is obtained for the subsample.
There have been many estimation methods dealing with logistic regression analysis of two-phase sampling data by making careful use of the information in the two sampling phases. Popular methods include the pseudo likelihood method (Breslow & Cain, 1988;Schill et al., 1993), the inverse probability weighted estimating method (Flanders & Greenland, 1991;Saegusa & Wellner, 2013), which takes the underlying sampling plan into account, and the maximum likelihood estimation method (Breslow & Holubkov, 1997;Lawless et al., 1999;Qin et al., 2015;Scott & Wild, 1997;Zhou et al., 2011). Among them, the methods developed by Lawless et al. (1999), Zhou et al. (2011), and Saegusa and Wellner (2013) are designed for general two-phase prospective data. As disclosed by Prentice and Pyke (1979), given usual case-control data, inferences for the odds ratio parameters based on prospective and retrospective likelihoods are equivalent. Therefore statistical methods designed for prospective data can usually be employed to make inference for retrospectively collected case-control data by taking them as if they were prospectively collected. Breslow and Holubkov (1997, p. 452) proved that this equivalence still holds for two-phase case-control data.
The existing works suggested to construct confidence intervals or conduct hypothesis testing for the odds ratio parameters based on the asymptotic normality of their proposed point estimators. This necessitates a consistent estimator for the accompanying asymptotic variance, which may have a very complicated form. In contrast, likelihood-ratio confidence intervals or tests are generally preferable as they are free of variance estimation. In addition, all the aforementioned methods for two-phase case-control data analysis are built on the logistic regression model assumption, whose misspecification may have a detrimental or invalid effect on the subsequent analysis. Therefore it is necessary to check whether the logistic model is valid or not for two-phase case-control data.
Logistic regression models are commonly used in analysing binary data which arise in studying the relationship between diseases and environment or genetic characteristics. See for example, Day (1980), Farewell (1979), and Prentice and Pyke (1979). Anderson (1979) noticed that the prospective logistic regression model is equivalent to the two sample exponential tilting model where one sample is corresponding to the covariate distribution for non-diseased group and the other one is the diseased group. With the standard case-control data, Qin and Zhang (1997) and Qin and Zhang (2005) proposed a Kolmogorov-Smirnov type statistic to test the validation of the logistic link function, where they implicitly used the empirical likelihood method (Owen, 1988) to estimate the underlying parameters. Qin (1998) found that the retrospective likelihood is equivalent to the empirical likelihood based on a density ratio model (Anderson, 1979(Anderson, , 1972. See Chen and Liu (2013), Cai et al. (2017), Diao et al. (2012), and Luo and Tsai (2012) for more about the density ratio model based on empirical likelihood.
Since the seminal work by Owen (1988), the empirical likelihood has become a popular nonparametric tool in statistical and economic literatures (Kitamura, 2006;Owen, 2001), as it has many nice properties paralleling to parametric likelihood. For example, empirical likelihood confidence regions are Bartlett correctable (DiCiccio et al., 1991;Liu & Chen, 2010), range preserving, transformation respecting, and do not require estimation of variance. It has the advantage of allowing the data to 'speak for itself' more and is more robust to slight mis-specification than parametric likelihood. Empirical likelihood has also found many applications in surveying sampling problems. Chen and Qin (1993) showed that empirical likelihood can effectively use auxiliary information. Chen et al. (2002) developed an elegant algorithm to implement the empirical likelihood method for finite-sample sampling problems. Zhao and Wu (2019) and Wu and Thompson (2020) provided a comprehensive review on the developments of empirical likelihood methods for complex surveys.
In this paper, we extend the empirical likelihood method for standard case-control data to two-phase case-control data. We find that the proposed maximum empirical likelihood point estimator is equal to Breslow and Holubkov (1997)'s maximum likelihood estimator under the logistic regression model. Unlike Breslow and Holubkov, (1997), we show that the empirical likelihood ratio statistic follows a limiting central chisquare distribution with a known degree of freedom. A remarkable advantage of the empirical likelihoodratio test over the existing asymptotic-normality-based tests is that it is free of variance estimation. This makes it more convenient to construct empirical likelihood confidence regions or empirical likelihood-ratio tests, to test, for example, whether a covariate has a significant effect on disease occurrence. Furthermore, we construct new Kolmogorov-Smirnov type goodness-offit tests to test the validation of the logistic regression model.
The rest of this paper is organised as follows. Section 2 introduces the data and model assumptions, presents empirical likelihood, goodness-of-fit tests based on the density ratio model, and investigates their large-sample properties. Simulation results and a real-data analysis are provided in Sections 3 and 4, respectively. Finally, Section 5 contains some discussions. For clarity, all technical proofs are relegated to the Supplementary Material.

Data and model
Let D denote the disease status with D = 1 for a disease (case) and 0 for a non-disease (control), and X and Z denote two covariate vectors, where Z takes only finitely many values: z 1 , z 2 , . . . , z J . Suppose the probability of disease occurrence in the population of interest follows a linear logistic regression model (1) where stands for transpose. Here and in what follows, we use pr to denote probability density function with respect to the counting measure for a discrete random variable, and the Lebesgue measure for an absolutely continuous random variable. We collect data from the population under study by two-phase case-control sampling. At the first phase, fixed number n 0 of controls and n 1 of cases (sample 1) are drawn, independent of X and Z. Let N ij denote the observed number of individuals with (D = i, Z = z j ) for i = 0, 1 and j = 1, 2, . . . , J. Then for each disease category, (N i1 , N i2 , . . . , N iJ ) is a random vector following a multinomial distribution with parameter n i and pr(z j | D = i) (j = 1, 2, . . . , J). At the second phase, within each Z-stratum of cases and controls in sample 1, a subsample is randomly selected, and exact or complete covariate measurements of the subsample are obtained (sample 2). Let M ij 's denote the (random or non-random) sample sizes of the 2 × J validation strata. We observe a random sample {X ijk : k = 1, 2, . . . , M ij } for each pair (i, j) with i = 0, 1 and j = 1, 2, . . . , J. Conditioning on {M ij } and {N ij }, all X ijk are assumed to be independent and X ij1 , X ij2 , . . . , X ijM ij are assumed to be identically distributed with probability density pr(x | D = i, Z = z j ).
Given {M ij } and {N ij }, the likelihood based on sample 2 is uninformative with respect to β. Therefore, the likelihood based on the two-phase case-control data is proportional to (2) which is exactly the likelihood in Equation (2) of Breslow and Holubkov (1997) for discrete covariate variables. This is the foundation of our empirical likelihood method. To proceed, we need to investigate pr(z j |D = i), and pr(x ijk |D = i, z j ) in Equation (2) and study the relationship between them.
Let π = pr(D = 1). By Bayes's formula, we have It follows that where we have used Equation (1). This implies that (3) and (4) gives In summary, we have expressed pr(z j |D = i) and pr(x ijk | D = i, z j ) in Equation (2) as functions of pr(x | D = 0, z j ) and some finite dimensional parameters.

Semi-parametric empirical likelihood
Putting the expressions in Equations (4) and (5) into Equation (2) and taking logarithm, we have a loglikelihood where N ·j = N 0j + N 1j . As Z takes only finite values, let q j = pr(z j |D = 0). In the principle of empirical likelihood, we model the conditional distributions of X given D = 0 and Z = z j by discrete distributions with support on the observations. Let p ijk = pr(x ijk | D = 0, Z = z j ). According to Equations (4) and (5), the feasible q j and p ijk satisfy (6) With these preparations, the proposed semi-parametric empirical log-likelihood is We denote N ·· = J j=1 N ·j = n and θ = (β , γ , η ) with η = (η 1 , . . . , η J ) . Inferences for (θ , α) are more conveniently made based on their profile likelihood, which is the maximum of˜ with respect to q j and p ijk under the constraints in Equation (6). By the Lagrange multiplier method, the maximum is achieved at where This immediately leads to the profile empirical loglikelihood function of (θ , α), does not depend on (θ , α). We propose to estimate (θ , α) by the maximum likelihood estimator (MLE), (θ,α) = arg max θ ,α (θ , α). The parameter α is merely a normalised parameter and generally of no importance. We may further profile α out and make inference for θ through (θ ) = max α (θ , α). Here, we abuse the notation (·), whose meaning is clear from the context. It is seen that the MLE of θ based on (θ ) is stillθ .
Lemma 2.1: Our maximum empirical likelihood estimator for (β, γ ) is identical to the maximum likelihood estimator of Breslow and Holubkov (1997) when their method takes the stratum variable into account.
Lemma 2.1 indicates that numerically our point estimator is equal to Breslow and Holubkov (1997)'s MLE. Even so, the density ratio model based on empirical likelihood framework for two-phase sampling data is new in the literature, and as far as we know, likelihoodratio-based inferences under this setting have not been investigated yet in the literature. Breslow and Holubkov (1997)'s statistical inferences (interval estimation and hypothesis testing) for the unknown parameters were based on the asymptotic normality of the MLE.

Asymptotics
In this section, we establish the limiting distributions of the MLEθ and the semi-parametric likelihood-ratio statistic. First of all, we introduce some conditions on the sampling plan and the population under study.
Condition (C1) guarantees that the sizes of the subsamples in all the strata are comparable, and they are also comparable with the case and control samples in the first phase. Otherwise, the strata with negligible sample sizes is simply discarded from our theoretical analysis.
Under Condition (C2), the moment generating functions of pr(x | D = 0, Z = z j ), namely exp(x β) × pr(x | D = 0, Z = z j ) dx, are well defined in a neighbourhood of the origin. Therefore in the stratum with D = 0 and Z = z j , the covariate X has all finiteorder moments. If model (1) is true, it follows from Equation (5) The finiteness of exp(x β)pr(x | D = 0, Z = z j ) dx in a neighbourhood of β 0 implies that the moment generating functions of pr(x | D = 1, Z = z j ) are also finite in a neighbourhood of the origin. Consequently, conditioning on D = 0 and Z = z j , the covariate X also has all finite-order moments.
Theorem 2.1: Suppose that model (1) and Conditions (C1) and (C2) all hold. As n → ∞, if V n converges to a nonsingular matrix V * , then (I) √ n(θ − θ 0 ) converges in distribution to a multivariate normal distribution with mean zero and variance-covariance matrix V * ; (II) the semi-parametric likelihood ratio 2{l(θ) − l(θ 0 )} has a limiting χ 2 K distribution, where K is the dimension of θ.
The proof of Theorem 2.1 also implies that the likelihood ratio statistic for any subvector of θ also follows an asymptotic chi-square distribution with a known degree of freedom. A key and interesting problem in case-control study is to check whether some or all of the covariates X and Z have significant effects on the disease occurrence D, or construct confidence intervals for their coefficients. Under the framework of the proposed semi-parametric empirical likelihood, we propose to construct confidence intervals and hypothesis tests based on the likelihood ratio function calibrated by its limiting chi-square distribution. Theorem 2.1 theoretically guarantees that for θ or any of its subvectors, our likelihood-ratio confidence intervals have asymptotically correct coverage probabilities, and our likelihood ratio tests have asymptotically correct type I errors.

Goodness-of-fit test
The previous nice properties of the proposed semiparametric likelihood method are achieved under the assumption of model (1). Possible misspecifications of this model pose a validation risk on the semiparametric likelihood-based inference. Therefore, it is necessary for checking the logistic regression model. We achieve this purpose by checking models (4) and (5) simultaneously, since roughly speaking the validation of the assumption of model (1) is equivalent to that of both models (4) and (5).
Similar to Qin and Zhang (1997), we construct Kolmogrov-Smirnov type tests for models (4) and (5). Let q j andp ijk be the fitted probability weights obtained by plugging the MLE (θ ,α) in Equation (7) andq ij = N ij /N i· . A measure of the departure from the assumption of model (4) Theorem 2.2: Under the same conditions as Theorem 2.1, 10j and 11j have a jointly normal limiting distribution with mean zero.
We consider testing the goodness-of-fit of model (5). We are using F ij (x) to denote the distribution function of X ijk given D = i and Z = z j , i = 0, 1 and j = 1, 2, . . . , J. With the fitted valuesp ijk , the empirical likelihood (EL) estimators of F ij (x) arê where the inequality x ijk ≤ x holds element-wise. Let F ij (x) denote the empirical distribution corresponding to the subsample {x ij1 , x ij2 , . . . , x ijM ij } for fixed i and j. A measure of the departure from the assumption of model (5)  An overall measure of departures from the assumption of models (4) and (5) where 10j , 11j , 20j , and 21j are all defined in Equations (8) and (9). The limiting distribution of is too complicated to be conveniently used in practice. We use a bootstrap procedure (Efron, 1979) to approximate it or the p-value. Before introducing the bootstrap procedure, we briefly review the original two-phase case-control data. The data in the first phase consists of n 0 controls (D = 0) and n 1 cases (D = 1). There are N ij data in the strata with D = i and Z = z j , for i = 0, 1 and j = 1, 2, . . . , J. Fix i = 0 or 1, the vector (N i1 , N i2 , . . . , N iJ ) follows a multinomial distribution with parameters n i and {pr(z j | D = i) : j = 1, 2, . . . , J}. For each pair (i, j), the data in the second phase are M ij independent and identically distributed observations X ij1 , X ij2 , . . . , X ijM ij from pr(x | D = i, Z = z j ). If M ij are non-random integers, they are pre-specified. Otherwise, they are often determined by sampling proportions, say r ij ∈ (0, 1), and then M ij = N ij r ij . We use q ij and F ij (x) to denote pr(Z = z j | D = i) and pr(x | D = i, Z = z j ), respectively. And we have shown that q 1j = q 0j exp(α 0 + γ 0 z j − η 0 ) under model (4), and Our bootstrap procedure to approximate the p-value of the goodness-of-fit test for model (1) based on is as follows.

t) under model (1). (c) Calculate the test statistic defined in Equation (10).
(2) Generate a bootstrap sample based on the original sample.
(a) Generate n 0 z-values from the discrete distribution defined by P(Z = z j ) =q 0j , and n 1 z-values from the discrete distribution defined by P(Z = z j ) =q 1j . We suppose that there are N * ij observations with D = i and Z = z j . Then the bootstrap sample in the first phase is S * 1 = {N * ij }. (b) At the strata with D = i and Z = z j , draw M ij (the non-random case) or M ij = N * ij r ij (the random case) observations fromF ij (x). The resulting observations constitute a bootstrap sample in the second phase, i.e.

Simulation study
By simulations, we compare our method with Breslow and Holubkov (1997)'s method from two aspects: interval estimation and hypothesis testing. We also investigate the finite-sample performance of the proposed goodness-of-fit test for model (1).

Interval estimation and hypothesis testing about θ
Our proposed interval estimators and hypothesis tests about θ are both based on the semi-parametric empirical likelihood-ratio function. In contrast, the counterparts of Breslow and Holubkov (1997) are both based on Wald's method and their maximum likelihood estimators, which are implemented with the R package missreg3 in R version 2.15.3. We use ELR and Wald to denote our and their methods, respectively. We generate two-phase case-control data from model (1), with x = (x 1 , x 2 , x 3 ) following the trivariate standard normal distribution, and z = [s], where s ∼ U(0, 2) and [s] denotes the nearest integer to s. The true parameter value of α * is set to −4 in all cases; and the rest model parameters (γ , β 1 , β 2 , β 3 ) are to be determined. We set n 0 = n 1 = 200, n 0 = n 1 = 100, and the number of simulations to 2000. We first compare the ELR and Wald methods from interval estimation. We construct confidence intervals for γ , (γ , β 1 ), (γ , β 1 , β 2 ), and (γ , β 1 , β 2 , β 3 ) at confidence levels 90%, 95%, and 99%. Two groups of parameter values are considered: (γ , β 1 , β 2 , β 3 ) = (0, 0, 1, 2) and (1, 0, 1, 2). The simulated coverage probabilities of the ELR and Wald intervals are tabulated in Tables 1 and 2. It is seen that both intervals have very accurate coverage probabilities. The ELR interval has slight under-coverage whereas the Wald interval has overcoverage when the total sample size is as small as n 0 = n 1 = 100. The under-coverage of the ELR interval is at most 2%; however, the over-coverage of the Wald interval can be as large as 4.5%, in the case of (γ , β 1 , β 2 ) at the 90% confidence level. Both of their coverage accuracies improve as the sample size increases to 200.
Next, we study the finite-sample performance of the ELR and Wald tests. We consider the hypothesis testing problems in Examples 3.1 and 3.2, respectively, which are concerned with the same four parameter combinations. The effect of the strata variable Z is set to zero in the null hypotheses of Example 3.1 and nonzero in those of Example 3.2, respectively. The simulated rejection rates of the ELR and Wald tests in all the cases are displayed in Figures 1 and 2. The rejection rates are simulated type I errors when k = 0 and are simulated powers when k > 0. It is seen that both tests have well controlled type I errors. The ELR test is uniformly more powerful than the Wald test in all cases except case (B4), where the two tests have very close powers. The power gain of the ELR test over the Wald test is as large as 10%, for example, in case (B1).
Overall, compared with the Wald method, the ELR interval has comparable accuracy while the ELR tests are often more powerful. It is worth mentioning that a clear advantage of the ELR test over the Wald test is that it does not need a variance estimation. The price is numerical optimisation, which is not an issue in the current era of high-speed computer.

Goodness-of-fit test for logistic regression model
In this subsection, we investigate the finite-sample performance of the proposed goodness-of-fit test for the logistic regression model (1); its null distribution is approximated by the bootstrap procedure in Section 2.4. No competitors are taken into account because to the best of our knowledge this problem has never been studied in the context of two-phase case-control data.
Similar to Example 3.3, model (1) is true in Example 3.4 when k = 0 or σ 2 2 = 1 and is violated otherwise. Table 3 presents the simulated rejection rates of the proposed goodness-of-fit test at the 5% significance level for Examples 3.3 and 3.4. Again the results corresponding to k = 0 are type I errors and others are powers. The proposed test has type I errors 6.1% and 5.1%, both of which are well controlled under the 5% significance level. As k increases from 1 to 4, the linear logistic model is violated more and more severely, and the proposed test has increasing powers to nearly 100%. This provides evidence for the consistency and validation of the proposed test for testing goodness-of-fit of model (1).

Real-data analysis
In this section, we illustrate the proposed semiparametric empirical likelihood method by analysing a simulated two-phase data set constructed by the actual National Wilms Tumor Study Group (NWTSG). See D' Angio et al. (1989), Breslow and Chatterjee (1999), and Green et al. (1998) for more description about the data. A problem of interest is to investigate whether there exists an association between treatment outcomes and tumour histology in 4028 children, who were diagnosed with the embryonal cancer of the kidney, known as Wilms tumour. The outcome variable of interest in this study is the relapse, which takes 1 (the patient's condition has deteriorated) and 0 (the patient's condition has improved) values. The covariates of interest include institution histology or IH (0 if favourable, 1 unfavourable), central histology or CH (0 if favourable, 1 unfavourable), stage ((1,0,0) if stage-II, (0,1,0) if stage-III, and (0,0,1) if stage-IV), and age (in months). Two types of histology measurements are available in the study. First, according to histology, the classification of tumours as favourable and unfavourable is based on the pathologist at the hospital where the children were admitted. As the study data came from many different hospitals, institutional histology may be prone to errors due to the subjective judgments of different pathologists. Thus, the NWTGS re-evaluated histology using a central pathologist recruited for the entire study which was called central histology, the second measurement for histology available in the study. The covariate IH has no prognostic value once the account has been taken into central histology. Even so, we take it as a stratum variable in two-phase case-control sampling. The histologic diagnosis results are tabulated in Table 4. We take the data in Table 4 as if they were prospective with a prevalence of 14.1%. In the first phase, we randomly choose N 1 cases from relapsed population and N 0 controls from non-relapsed population, where N 1 = 500 and N 0 = 500. Then we classified the N 1 cases and N 0 controls according to their IH condition. In the second phase, we randomly choose M ij = N ij /3 (i = 0, 1; j = 1, 2) observations in each subpopulation.
Before applying model (1) and our empirical likelihood method to analyse the data, we use the proposed goodness-of-fit test to check the validation of the linear logistic regression model. The resulting p-value of 0.71 provides no evidence against this model at 5% significance level.
Wilms tumours have no clear cause, but there are some potential factors that affect the risk. Besides stage, age may be an important prognostic factor for Wilms tumour. Figure 3 displays the histograms of age at all the four combinations of outcome and stratum variables. We see that the four condition distributions of age are quite different from each other, which intuitively Figure 3. Histograms of Age in different combinations of outcome (Row 1, relapses or cases; Row 2, non-relapses or controls) and the stratum variable (IH; Column 1, unfavourable; Column 2, favourable). implies that age is probably associated with the outcome. Formally the ELR (p-value of 0.0014) and Wald (p-value of 0.0020) tests both provide strong evidence for the association of age and outcome at the 5% significance level, although the evidence from the ELR test is stronger. We also test whether stage and IH are associated with the outcome. For stage, the ELR test gives a positive answer (p-value of 0.0498) while the Wald test gives a negative answer (p-value of 0.0534). Both tests give negative answers (their p-values are 0.2082 and 0.1871, respectively) for IH and give positive answers (their p-values are 0 and 0.001, respectively) for CH. The point and interval estimators of the coefficients of all the covariates are presented in Table 5. The coefficient of age, 0.1316, is positive, indicating that older people are more likely to relapse than younger people. The coefficient of CH is significantly nonzero while no evidence supports that of IH is nonzero. This result is consistent with that IH has no prognostic value once account has been taken into central histology. The interval estimators confirm the above hypothesis testing results: namely age and CH are important factors to relapse.

Discussion
This paper develops an empirical likelihood approach to two-phase case-control data. We require that the covariate Z takes finite different values because it is regarded as a stratification variable. Two issues about the covariate or strata are worth discussing. First, if Z is continuous, we need to transform it to a discrete variable U taking finite different values. Then we may proceed with U in place of Z, although there may be information loss in doing so. Alternatively, we may take U as a stratification variable, and take the raw variable Z as a subvector of X in model (1), so that the effect of Z on the disease can still be studied. Second, the performance of the proposed method depends on the approximation accuracy of its large-sample properties and may be undermined if some sample sizes of the 2J strata are too small. A large number of strata may make some strata have very small sizes. To avoid this problem, we recommend stratifying the data such that there are at least 5 observations in each strata and the number of strata in each disease status is small, say 5. Strata with too small sizes should be merged to be large strata so that the asymptotic normality of the proposed estimator and the limiting chisquare distribution of the proposed likelihood ratio have acceptable accuracy.