Rejoinder on “A selective review of statistical methods using calibration information from similar studies”

We thank Professor Jun Shao for organizing this interesting discussion. We also thank the six discussants formany insightful comments and suggestions. Assembling data from different sources has been becoming a very popular topic nowadays. In our review paper, we have mainly discussed many integration methods when internal data and external data share a common distribution, though the external data may not have information for some underlying variables collected in the internal study. Indeed the common distribution assumption is very strong in practical applications. Due to the technology advance, the collection of data is gettingmuch easier, for example, by using i-phone, satellite image, etc. As those collected data are not obtained by well-designed probability sampling, inevitably, they may not represent the general population. As a consequence, there probably exists a systematic bias. In the survey sampling literature, how to combine survey sampling data with non probability sampling data has also got very popular (Chen et al., 2020). Without bias correction, most existing methods may produce biased results if the common distribution assumption is violated. One has to be careful to assess the impartiality before data integration. Before we respond to the common concern by the reviewers on the heterogeneity among different studies, we first outline the possible distributional shifts or changes in each source data. In themachine learning literature, the concepts of covariate shift, label shift, and transfer learning have been widely used (QuiñoneroCandela et al., 2009). We briefly highlight those concepts in terms of statistical joint density or conditional density. Covariate shift: Let Y and X be, respectively, the outcome and a vector of covariates in Statistic terminology, or a label variable and a vector of features in Machine Learning Languish. Suppose we have two data-sets:

We thank Professor Jun Shao for organizing this interesting discussion. We also thank the six discussants for many insightful comments and suggestions. Assembling data from different sources has been becoming a very popular topic nowadays. In our review paper, we have mainly discussed many integration methods when internal data and external data share a common distribution, though the external data may not have information for some underlying variables collected in the internal study. Indeed the common distribution assumption is very strong in practical applications. Due to the technology advance, the collection of data is getting much easier, for example, by using i-phone, satellite image, etc. As those collected data are not obtained by well-designed probability sampling, inevitably, they may not represent the general population. As a consequence, there probably exists a systematic bias. In the survey sampling literature, how to combine survey sampling data with non probability sampling data has also got very popular (Chen et al., 2020). Without bias correction, most existing methods may produce biased results if the common distribution assumption is violated. One has to be careful to assess the impartiality before data integration.
Before we respond to the common concern by the reviewers on the heterogeneity among different studies, we first outline the possible distributional shifts or changes in each source data. In the machine learning literature, the concepts of covariate shift, label shift, and transfer learning have been widely used (Quiñonero-Candela et al., 2009). We briefly highlight those concepts in terms of statistical joint density or conditional density.
Covariate shift: Let Y and X be, respectively, the outcome and a vector of covariates in Statistic terminology, or a label variable and a vector of features in Machine Learning Languish. Suppose we have two data-sets: a training data-set : , and a testing data-set : and q k (y) are the joint density function of (X, Y), the conditional density function of Y given X = x, the marginal density function of X, the conditional density function of X given Y = y, and the marginal density function of Y, respectively. The subscript k = 0 and 1 correspond to the training data and the testing data, respectively. The covariate-shift assumption is where the conditional density of Y given X remains unchanged from the training data to the testing data, but the marginal covariate distribution shifts. The most popular assumption on the shifted covariate distribution is where r(x) is a known density ratio.
Label shift: The popular label shift assumption in machine learning is If the outcome Y is the status of a disease and X is symptoms, a problem of interest is to predict the disease status given the symptoms. In machine learning literature, people may make the anticasual assumption that it is the disease status causes the symptoms. In the label shift assumption, the conditional density of X given Y does not change between different studies, however, the marginal distribution of the disease status Y changes in different studies.
Transfer learning: Let μ i (x) = yp i (y | x) dy be the conditional means for i = 0, 1. Suppose a parametric model is assumed for μ 0 (x) in the training data, say, μ 0 (x) = μ 0 (x; θ), where μ 0 (x; θ) is known up to unknown finite dimensional parameter θ . A popular assumption in transfer learning is where g is a monotone function depending on an unknown parameter η. For a low dimensional covariate case, one may assume θ = θ 1 . In the high dimensional covariate case, on the other hand, one may assume = θ 1 − θ to be 0 for most components of . Then penalized likelihood methods can be applied to select those non-zero components.

Response to Professor Lawless
We would like to thank Professor Lawless for his insightful comments. We totally agree with his view on testing the compatibility before combining internal and external data together.
Suppose the internal data (Y, X, Z) ∼ f (y | x, z) × g(x, z) and external summarized information derived from (Y e , X e ) are available. Since Z e is not available, we may not be able to test even if (Y e , X e ) are available. The best we can do is to test the joint distributions of (X, Y) from the internal data and external data are the same if both of them are available. If a small portion of external data (Y e , X e , Z e ) is also available, certainly it is possible to test the distributional agreement between two sources of data.
The most popular approach in meta analysis is to estimate the mean treatment effect over different but similar studies. The basic assumption is that the true mean is the same across different studies, but to reflect the possible discrepancy among studies, a possible unexplained variation is assumed in each study. In general raw data from each study are unavailable except for summarized information. Later on, meta analysis was extended to the regression case, for example, where Y ij is the i-th outcome in j-th study, i = 1, 2, . . . , n; j = 1, 2, . . . , K. Again to allow the variation in each study, one may assume β j ∼ N(β, ). According to our understanding (easily can be wrong), even if the covariate distributions of (X ij , Z ij ) are quite different among studies, the regression coefficients β and γ can be estimated without any problem. On the other hand, if one is interested in estimating the marginal parameter such as the mean of μ j = E(Y ij ), then the simple combination ofμ j (the sample version of μ j ) is meaningless since μ j s vary across studies.
Professor Lawless has further pointed out the possibility of combining information for the formulation of predictive models. By discovering some new covariates, one may gain substantial gains in predictive performance. Nevertheless, the general methodology works are not well developed yet. A recent work by Efron (2020) has indicated that in general the prediction problem is easier than the attribute estimation. Moreover, in the discussion of Professor Efron's paper, Xie and Zheng (2020) disclosed that one may have a correct coverage for the prediction for a future value of the response even if the underlying model is incorrect as long as the independent and identically distributed structure remains true. However, a correct model will produce a confidence interval with the narrowest width.

Response to Professor Han
Building on the early work by Sheng et al. (2021), Professor Han has suggested a calibration method in the covariate shift problem. Moreover, if the dimension of the covariate is large, Chen et al. (2021) and Zhai and Han (2022) have used a penalized likelihood method to regularize the underlying parameters. Definitely, the use of summarized information in highdimensional parameter problems is welcome.
Professor Han has discussed a different approach to combine information. Let Y be a response variable, and X and Z be two covariates, where both X and Z are available in the internal data and only X and Y are available in the external data. In essence, they (Taylor et al., 2022) assume where μ(·) is a known link function, and β 0 , β x , β z , θ 0 and θ x are unknown parameters. Under the assumption that X and Z are independent or at least uncorrelated and that the covariate effects are closed to 0, they show approximately, where α is an unknown scale parameter. Based on the external informationθ * x , they fit a model The information gain in the newly formed model comes from the fact that it has a scale parameter α only instead of the vector parameters β in the original model. On the other hand, the compatibility of these two models is hard to satisfy, and more systematic works are needed.

Response to Professors Zhou and Song
In addition to echoing the same message that the heterogeneity among different studies and batches of data, Professors Zhou and Song have laid out many challenging issues in fusing different data sources, including the issue that the number of data batches tends to infinity, one-round communication and unlimited rounds of communications in the case that the number of data batches increases. Indeed, if the size of data gets very large, bias becomes critical since variance gets almost negligible. Moreover, Professors Zhou and Song have given many useful references in the machine learning literature on information borrowing but accounting for individualized heterogeneity. It is indispensable to develop new and optimal algorithms to deal with large data and high dimensional problems. The three future directions outlined by Professors Zhou and Song are important and welcome.

Response to Professor Ning
Whether conventional statistical methods borrowing information from similar studies work depends on the testing conclusion of the hypothesis whether internal data and summarized external information are compatible. Professor Ning has advocated a systematic way to do so by using a similar principal to transfer learning. To accomplish this, Chen et al. (2021) used the penalized likelihood method. More researches in this direction are welcome. Of course, user-friendly softwares are urgent to be developed.

Response to Professor Chen
We sincerely thank Professor Chen for the three crucial technical problems concerning empirical likelihood for estimation equations. Qin & Lawless (1994) assumed that the estimating function g(x; θ) is smooth enough so that the usual Taylor series approximation method applies. With non-smooth estimating equations, the profile empirical likelihood function become a zigzag function and the Taylor series approximation method fails to work, making it challenging to establish the limiting distribution of the maximum empirical likelihood estimator (MELE). With the help of advanced empirical process theory, enough smoothness of the expected estimating function E{g(X; θ)} are sufficient to guarantee the standard limiting distributions of the MELE and the empirical likelihood ratio (Molanes Lopez et al., 2009). About the global consistency of the MELE, although of great importance to the theoretical completeness of empirical likelihood, this fundamental property has not been rigorously established yet in the literature. We appreciate that Professor Chen has outlined a proof for the global consistency.
The last issue is on the efficiency issue of nonparametric maximum likelihood estimator (MLE). Many thanks to Professor Chen for pointing out our mistake in Section 3.4: ∇ θ log{h(x, θ 0 , 0)} = 0, which does not hold. With slight modification, the conclusion there is still true. As Professor Chen suggested, we replace f (x, θ) by the true density function of X, say f 0 (x). Define a enlarged parametric density function which clearly includes f 0 (x) as a special case. In addition, we require that η = η(θ) satisfy and η(θ 0 ) = 0, where θ 0 is the true value of θ . In contrast to Back and Brown (1992), our construction of the enlarged density is not from the exponential family.
Defineh(x; θ) = h(x; θ, η(θ)). Given n observations X 1 , . . . , X n from f 0 (x), the log-likelihood function is Letθ be the MLE under the parametric modelh(x; θ). Then under certain regularity conditions, By equality (3), where E takes expectation with respect to f 0 (x). By this result and tedious algebra, we find that is the asymptotic variance of the MELEθ. Consequently, which implies that under the parametric modelh(x; θ), the MLE of θ has the asymptotic variance V. Thus, under only the general estimating equation model E{g(X, θ)} = 0, the best estimator of θ should also have an asymptotic variance at least as large as V. Because the MELE of θ of Qin & Lawless (1994) has the asymptotic variance V, we conclude that it achieves the semiparametric lower information bound, as claimed in Section 3.3 of our review paper.

Conclusion
The simple integration method may produce biased results in the presence of distribution shifts. When assembling information from different data sources, one has to understand the data generating process, accordingly, to make judiciously choices on different modelling methods. More importantly, characterizing the selection bias plays an extremely important role in data fusing.

Disclosure statement
No potential conflict of interest was reported by the author(s).