A discussion on “A selective review of statistical methods using calibration information from similar studies” by Qin, Liu and Li

We Qin, Liu and Li (QLL) on a thoughtful and much needed review of many interesting methods for combining information from similar studies. We appreciate being given the opportunity to make a discussion. QLL cover a variety of different settings and methods. Based on that, we will provide a brief review on some additional relevant literature with a focus on methods that deal with population heterogeneity, since it is most likely that different studies sample different and whether information be combined depends on how similar those among many other To the we will follow the setting in of QLL, most of methods more broadly applied.

We congratulate Qin, Liu and Li (QLL) on a thoughtful and much needed review of many interesting methods for combining information from similar studies. We appreciate being given the opportunity to make a discussion. QLL cover a variety of different settings and methods. Based on that, we will provide a brief review on some additional relevant literature with a focus on methods that deal with population heterogeneity, since in practice it is most likely that different studies sample from different populations and whether information should be combined depends on how similar those populations are, among many other considerations. To keep the discussion focussed, we will follow the setting in Section 5 of QLL, although most of these methods can be more broadly applied.
We adopt the notation in Section 5.1 of QLL with some variations. Let (Y i , X T i , Z T i ) T , i = 1, . . . , n, denote the individual data based on a random sample from the internal study, where Y is the response and X and Z are vectors of covariates. The model of interest is f (Y | X, Z; β) for f (Y | X, Z) with parameter β. The external study fitted a (possibly misspecified) model f (Y | X; θ ) for f * (Y | X) with covariates X alone and parameter θ . Throughout this discussion we use a * -superscript to denote distributions/expectations/quantities associated with the external study population. The external model fitting information is summarized by where h(Y, X; θ ) is the score function for f (Y | X; θ ) and θ * is the solution to the score equation based on the external study sample. Individual data from the external sample are not available. We assume the external sample size is very large so that the uncertainty in θ * is negligible compared to the internal study, i.e. Case I in QLL. QLL in Section 5 give an excellent review of some methods and their comparisons in terms of asymptotic efficiency when the internal and external study populations are the same, based on both the empirical likelihood (EL) formulation (Owen, 1988;Qin & Lawless, 1994) and the constrained maximum likelihood (CML) formulation (Chatterjee et al., 2016;Qin, 2000). These two formulations are closely connected (Han & Lawless, 2016). For ease of discussion, we provide the CML formulation here, which is already covered in QLL. When f (Y, X, Z) = f * (Y, X, Z), (1) can be transformed into where The CML estimator β cml is defined through where where the expectation E * (·) is under f * (X, Z) in contrast to the E(·) in (2) that is under f (X, Z). In this case, although (2) no longer holds, there are ways to make use of the external study information summarized by (5). One way is to collect a small supplementary sample (X * T j , Z * T j ) T , j = 1, . . . , n * , from the external study population (Chatterjee et al., 2016;Han & Lawless, 2019), and define the CML estimator through (Sheng et al., 2021). This approach models the heterogeneity between f * (X, Z) and f (X, Z) by exp(W T α), and is in the same spirit that an exponential tilting model is specified to link the case distribution to the control distribution as in Section 7.1 of QLL.
With the density ratio model, the CML estimator can be defined through Here the last two constraints are based on the fact that f * (X, Z) = exp(W T α)f (X, Z) is a density and (5), respectively. Note that the dimension of α needs to be no larger than the dimension of ψ plus one for α to be identifiable. , Z), applying transformation (3) to (1) leads to neither (2) nor (5) in general, regardless of if f (X, Z) is the same as f * (X, Z). In this case, the aforementioned CML estimators are biased. However, since the external study sample size is large, the reduction in variance by making use of external summary information may still benefit the internal study parameter estimation from a mean squared error perspective. Based on consideration of such a bias-variance trade-off, Estes et al. (2018) proposed an empirical Bayes shrinkage estimator of the form that is a weighted average of β mle , the MLE using the internal study data alone, and β cml defined through (4).
Here V mle and V 0 are estimated variance matrices for β mle and for the prior normal distribution on β, respectively. This method shrinks the final estimate towards β mle in the presence of population heterogeneity and towards β cml otherwise. Gu et al. (2021) extended the idea in Estes et al. (2018) to the case of multiple external studies, and propose an estimator that is a weighted average of the empirical Bayes estimators resulted from using each external study separately.
To deal with arbitrary population heterogeneity when information is available from multiple external studies but without causing estimation bias, Zhai and Han (2022) developed an estimation procedure that simultaneously selects the studies that give (2) and incorporates the corresponding information into internal model fitting. Their method also applies under the current setting of only one external study. When (2) does not hold because of an unknown form of heterogeneity, some components of E{ψ(X, Z; β)} may still be zero even if E{ψ(X, Z; β)} is not a zero vector, and these components still contain useful information to improve internal model estimation efficiency. This observation makes sense because the association between the same response and certain covariates may not differ much across populations with certain specific heterogeneity.
Some general examples on this observation are given in Zhai and Han (2022). Let γ ≡ E{ψ(X, Z; β)}, then the components of ψ(X, Z; β) that correspond to zerocomponents of γ should be selected to compute β cml . To shrink the estimate of the zero-components of γ to exactly zero for information integration, under the current setting, the estimator in Zhai and Han (2022) with an adaptive Lasso penalty (Zou, 2006) on γ with tuning parameter λ n > 0, where γ k is the kth component of γ = n −1 n i=1 ψ(X i , Z i ; β mle ) and w > 0 is some user-specified positive number typically taken to be 1 or 2. Similar idea has also been considered for integrating external summary information into survival analysis with different penalty (Chen et al., 2021).
All of the aforementioned methods make use of the external summary information through transformation (3). Taylor et al. (2022) took a different approach when both f (Y | X, Z; β) and f (Y | X; θ ) are generalized linear models (GLM), namely and l{E(Y | X)} = θ 0 + X T θ X , with possibly different link functions g(·) and l(·) (note that the second GLM may be misspecified). Here the notation E(·) for expectation is generic to simply present the form of the model. For ease of discussion, assume covariates Z have been orthogonalized to X, which can be done by taking Z to be the vector of residuals of the least squares regression of each covariate in Z on X using the internal data. Taylor et al. (2022) showed that β X ≈ cθ X for some unknown constant c when both GLMs are fitted to the same infinitely large sample and when β X , β Z and θ X are all close to zero (here in this sentence, with some abuse of notation, β X , β Z and θ X are the values after fitting both GLMs to the same infinitely large sample). See also Neuhaus and Jewell (1993). Based on this result, when f (Y | X) and f * (Y | X) are similar in the sense that the relative effects of X on Y (but not necessarily the absolute magnitudes) are the same between the two populations, the θ * X produced by the external study can still be used to improve the internal estimation efficiency. With Z orthogonalized to X, instead of fitting (6), Taylor et al. (2022) proposed to fit with coefficients (β 0 , α, β T Z ) T to the internal study data, which is equivalent to letting β X = αθ * X . In the presence of different study populations, a crucial question to ask before combining information is which population is of the primary interest. Most of the methods reviewed in this discussion, if not all, explicitly or implicitly assume that the internal study population is of the primary interest and the external summary information is used for efficiency improvement without causing (too much) bias. This is reasonable, for example, when the internal study has a clear target population and is based on a careful design with a well controlled sampling. In practice, with data usually collected based on convenience sampling, the internal study sample may not be representative of the target population, or there may even be ambiguity about the target population itself. Therefore, some cautions are always needed when applying those methods to combine information in the presence of population heterogeneity, and more methodological developments are definitely needed.

Disclosure statement
No potential conflict of interest was reported by the author.