Kernel regression utilizing heterogeneous datasets

Data analysis in modern scientific research and practice has shifted from analysing a single dataset to coupling several datasets. We propose and study a kernel regression method that can handle the challenge of heterogeneous populations. It greatly extends the constrained kernel regression [Dai, C.-S., & Shao, J. (2023). Kernel regression utilizing external information as constraints. Statistica Sinica, 33, in press] that requires a homogeneous population of different datasets. The asymptotic normality of proposed estimators is established under some conditions and simulation results are presented to confirm our theory and to quantify the improvements from datasets with heterogeneous populations.


Introduction
With advanced technologies in data collection and storage, in modern statistical analyses we have not only a primary random sample from a population of interest, which results in a dataset referred to as the internal dataset, but also some independent external datasets from sources such as past investigations and publicly available datasets.In this paper, we consider nonparametric kernel regression (Bierens, 1987;Wand & Jones, 1994, December;Wasserman, 2006) between a univariate response Y and a covariate vector U from a sampled subject, using the internal dataset with the help from independent external datasets.Specifically, we consider kernel estimation of the conditional expectation (regression function) of Y given U = u under an internal data population, where D = 1 indicates internal population and u is a fixed point in U , the range of U. The indicator D can be either random or deterministic.The subscript 1 in μ 1 (u) emphasizes that it is for internal data population (D = 1), which may be different from μ(u) = E(Y | U = u), a mixture of quantities from the internal and external data populations.
When external datasets also have measurements Y and U, we may simply combine the internal and external datasets when the populations for internal and external data are identical (homogeneous).However, heterogeneity typically exists among populations for different datasets, especially when there are multiple external datasets collected in different ways and/or different time periods.In Section 2, we propose a method to handle heterogeneity among different populations and derive a kernel regression more efficient than the one using internal data alone.The result is also a crucial building block for the more complicated case in Section 3 where external datasets contain fewer measured covariates as described next.
In applications, it often occurs that an external dataset has measured Y and X from each subject, where X is a part of the vector U, i.e., some components of U are not measured due to high measurement cost or the progress of technology and/or scientific relevance.With some unmeasured components of U, the external dataset cannot be directly used to estimate μ 1 (u) in (1), since conditioning on the entire U is involved.To solve this problem, Dai and Shao (2023) proposes a two-step kernel regression using external information as a constraint to improve kernel regression based on internal data alone, following the idea of using constraints in Chatterjee et al. (2016) and H. Zhang et al. (2020).However, these three cited papers mainly assume that the internal and external datasets share the same population, which may be unrealistic.The challenge in dealing with the heterogeneity among different populations is similar to the difficulty in handling nonignorable missing data if unmeasured components of U is treated as missing data, although in missing data problems we usually want to estimate μ(u) = E(Y | U = u) = μ 1 (u) in (1).
In Section 3, we develop a methodology to handle population heterogeneity for internal and external datasets, which extends the procedure in Dai and Shao (2023) to heterogeneous populations and greatly widens its application scope.
Under each scenario, we derive asymptotic normality in Section 4 for the proposed kernel estimators and obtain explicitly the asymptotic variances, which is important for large sample inference.Some simulation results are presented in Section 5 to compare finite sample performance of several estimators.Discussions on extensions and handling high dimension covariates are given in Section 6.All technical details are in the Appendix.

Efficient kernel estimation by combining datasets
The internal dataset contains observations (Y i , U i ), i = 1, . . ., n, independent and identically distributed (iid) from P 1 , the internal population of (Y, U), where Y is the response and U is a p-dimensional covariate vector associated with Y.We are interested in the estimation of conditional expectation μ 1 (u) in (1).The standard kernel regression estimator of μ 1 (u) based on the internal dataset alone is where κ b (a) = b −p κ(a/b), κ(•) is a given kernel function on U (the range of u), and b > 0 is a bandwidth depending on n.We assume that U is standardized so that the same bandwidth b is used for every component of U in kernel regression.Because of the well-known curse of dimensionality for kernel-type methods, we focus on a low dimension p not varying with n.A discussion of handling a large dimensional U is given in Section 6.We consider the case with one external dataset, independent of the internal dataset.Extension to multiple external datasets is straightforward and discussed in Section 6.
In this section we consider the situation where the external dataset contains iid observations (Y i , U i ), i = n + 1, . . ., N, from P 0 , the external population of (Y, U).

Combing data from homogeneous populations
If we assume that the two populations P 1 and P 0 are identical, then we can simply combine two datasets to obtain the kernel estimator which is obviously more efficient than μ 1 (u) in (2) as the sample size is increased to N > n.The estimator μ E1 1 (u) in (3), however, is not correct (i.e., it is biased) when populations P 1 and

Combing data from heterogeneous populations
We now derive a kernel estimator using two datasets and is asymptotically correct regardless of whether P 1 and P 0 are the same or not.Let f (y | u, D) be the conditional density of Y given U = u and D = 1 or 0 (for internal or external population).Then The resulting kernel estimator is ( 5 ) Note that we use internal data Applying kernel estimation, we obtain that where κ and κ are kernels with dimensions p + 1 and p and bandwidths b and b, respectively.The estimator in ( 5) is asymptotically valid under some regularity conditions for kernel and bandwidth, summarized in Theorem 4.1 of Section 4.

Combing data from heterogeneous populations with additional information
If additional information exists, then the approach in Section 2.2 can be improved.Assume that the internal and external datasets are formed according to a random binary indicator D such that , where Y i and U i are observed internal data when D i = 1, Y i and U i are observed external data when D i = 0, and N is still the known total sample size for internal and external data.In this situation, the internal and external sample sizes are n = N i=1 D i and N−n, respectively, both of which are random.In most applications, the assumption of random D is not substantial.From the identity A further improvement can be made if the following semi-parametric model holds, where α(•) is an unspecified unknown function and γ is an unknown parameter.From ( 7)-( 8), 3) is correct.Under (9) with γ = 0, we just need to derive an estimator γ of γ and apply kernel estimation to estimate E(e γ Y | U = u, D = 1) as a function of u.Note that we do not need to estimate the unspecified function α(•) in (8), which is a nice feature of semi-parametric model (8).
We now derive an estimator γ .Applying ( 7)-( 8) to (4), we obtain that where the second and third equalities follow from (8) and the last equation follows from For every real number t, define Its estimator by kernel regression is where κ is a kernel and b is a bandwidth.Then, we estimate γ by motivated by the fact that the objective function for minimization in (11) approximates and, for any t, with in view of (9).
In applications, we need to choose bandwidths with given sample sizes n and N−n.We can apply the k-fold crossvalidation as described in Györfi et al. (2002).Requirements on the rates of bandwidths are described in theorems in Section 3.

Constrained kernel regression with unmeasured covariates
We still consider the case with one external dataset, independent of the internal dataset.In this section, the external dataset contains iid observations (Y i , X i ), i = n + 1, . . ., N, from the external population P 0 , where X is a q-dimensional sub-vector of U with q < p.
Since the external dataset has only X, not the entire U, we cannot apply the method in Section 2 when q < p. Instead, we consider kernel regression using external information in a constraint.First, we consider the estimation of the n-dimensional vector μ 1 = (μ 1 (U 1 ), . . ., μ 1 (U n )) , where A denotes the transpose of vector or matrix A throughout.Note that the standard kernel regression (2) estimates μ 1 as Taking partial derivatives with respect to μ i 's, we obtain that We improve μ 1 by the following constrained minimization, where g(x) = (1, x ), l in ( 14) is a bandwidth that may be different from b in (2) or ( 13), and h Ej 1 (x) is the kernel estimator of h 1 (x) = E(Y | X = x, D = 1) using the jth of the three methods described in Section 2, j = 1, 2, 3. Specifically, h E1 1 (x) is given by (3), h E2 1 (x) is given by ( 5), and h E3 1 (x) is given by ( 12), with u and U replaced by x and X, respectively, and kernels and bandwidths suitably adjusted as dimensions of U and X are different.Note that h Ej 1 can be computed as both internal and external datasets have measured X i 's.
It turns out that μ Cj 1 in ( 14) has an explicit form μ 15) is an empirical analog of the theoretical constraint 14) is more accurate than the unconstrained μ 1 in (13).
To obtain an improved estimator of the entire regression function μ 1 (•) in (1), not just the function at u = U i , i = 1, . . ., n, we apply the standard kernel regression with response vector (Y 1 , . . ., Y n ) replaced by μ Cj 1 in ( 14), which results in the following three estimators of μ 1 (u): where 14) and b is the same bandwidth in (2).The first estimator μ C1 i is simple, but can be incorrect when populations P 1 and P 0 are different.The asymptotic validity of μ C2 1 and μ C3 1 are established in the next section.

Asymptotic normality
We now establish the asymptotic normality of μ Ej 1 (u) and μ Cj 1 (u) for a fixed u, as the sample size of the internal dataset increases to infinity.All technical proofs are given in the Appendix.
The first result is about μ E2 1 (u) in ( 5).The result is also applicable to μ E1 1 (u) in (3) with an added condition that P 1 = P 0 .
Theorem 4.1: Assume the following conditions.
(B1) The densities f 1 (u) and f 0 (u) for U, respectively under internal and external populations have continuous and bounded first-and second-order partial derivatives.
(Lemma 8.10 in Newey & McFadden, 1994), where o p (1) denotes a term tending to 0 in probability.Result (18) implies that the estimation of ratio f Note that both the squared bias B 2 a (u) and variance V a (u) in (17) are decreasing in the limit a = lim n→∞ (N − n)/n, a quantity reflecting how many external data we have.In the extreme case of a = 0, i.e., the size of the external dataset is negligible compared with the size of the internal dataset, result (17) reduces to the wellknown asymptotic normality for the standard kernel estimator μ 1 (u) in (2) (Bierens, 1987).In the other extreme case of a = ∞, on the other hand, B a (u) = V a (u) = 0 and, hence, μ E2 1 (u) has a convergence rate tending to 0 faster than 1/ √ nb p , the convergence rate of the standard kernel estimator μ 1 (u).The next result is about μ C2 1 (u) in ( 16) as described in Section 3.
(C1) The range U of U is a compact set in the p-dimensional Euclidean space and f 1 (u) is bounded away from infinity and zero on U ; f 1 (u) and f 0 (u) have continuous and bounded first-and second-order partial derivatives.

and the bandwidth b h for h
Then, for any fixed u ∈ U and μ C2 1 (u) in ( 16), where

is bounded away from zero, and it is dth-order continuously differentiable with bounded partial derivatives on an open set containing the support of
, and U is the range of U. Also, there exists an almost everywhere continuous 8-dimensional function ν(U) with

Then, as the total sample size of internal and external datasets
where Conditions (D1)-(D5) are technical assumptions discussed in Lemmas 8.11 and 8.12 in Newey and McFadden (1994).As discussed by Newey and McFadden (1994), the condition that κ has a bounded support can be relaxed, as it is imposed for a simple proof.
Corollary 4.1: Suppose that (8) holds for the binary random D indicating internal and external data.

The performance of μ
Cj 1 given by ( 16) We first present simulation results to examine and compare the performance of the standard kernel estimator μ 1 in (2) without using external information and our proposed estimator (16) with three variations, μ C1 1 , μ C2 1 , and μ C3 1 , as described in the end of Section 3. We consider U = (X, Z) with univariate covariates X and Z, where Z is unmeasured in the external dataset (p = 2 and q = 1).The covariates are generated in two ways: (i) normal covariates: (X, Z) is bivariate normal with means 0, variances 1, and correlation 0.5; (ii) bounded covariates: where W 1 , W 2 , and W 3 are identically distributed as uniform on [−1, 1], B is uniform on [0, 1], and W 1 , W 2 , W 3 , and B are independent.
Conditioned on (X, Z) , the response Y is normal with mean μ(X, Z) and variance 1, where μ(X, Z) follows one of the following four models: Note that all four models are nonlinear in (X, Z) ; (M1)-(M2) are additive models, while (M3)-(M4) are nonadditive.
A total of N = 1, 200 data are generated from the population of (Y, X, Z) as previously described.A data point is treated as internal or external according to a random binary D with conditional probability , where γ = 0 or 1/2, and γ 0 = 1 or −1.5.Under the setting γ 0 = 1 or −1.5, the unconditional P(D = 1) ≈ n/N is around 13% or 50%.
The simulation studies performance of kernel estimators in terms of mean integrated square error (MISE).The following measure is calculated by simulation with S replications: where {U s,t : t = 1, . . ., T} are test data for each simulation replication s, the simulation is repeated independently for s = 1, . . ., S, and μ * 1 is one of μ 1 , μ C1 1 , μ C2 1 , and μ C3 1 , independent of test data.We consider two ways of generating test data U s,t 's.The first one is to use T = 121 fixed grid points on [−1, 1] × [−1, 1] with equal space.The second one is to take a random sample of T = 121 without replacement from the covariate U's of the internal dataset, for each fixed s = 1, . . ., S and independently across s.
To show the benefit of using external information, we calculate the improvement in efficiency defined as follows: where the minimum is over and μ C3 1 .In all cases, we use the Gaussian kernel.The bandwidths b and l affect the performance of kernel methods.We consider two types of bandwidths in the simulation.The first one is 'the best bandwidth'; for each method, we evaluate MISE in a pool of bandwidths and display the one that has the minimal MISE.This shows the best we can achieve in terms of bandwidth, but it cannot be used in applications.The second one is to select bandwidth from a pool of bandwidths via 10-fold cross validation (Györfi et al., 2002), which produces a decent bandwidth that can be applied to real data.
The simulated MISE values based on S = 200 replications are shown in Tables 1-4.Consider first the results in Tables 1-2.Since γ = 0, all three estimators, μ C1 1 , μ C2 1 , and μ C3 1 , are correct and more efficient than the standard estimator μ 1 in (2) without using external information.The estimator μ C1 1 is the best, as it uses the correct information that populations are homogeneous (γ = 0) and is simpler than μ C2 1 and μ C3 1 .Next, the results in Tables 3-4 for γ = 1/2 indicate that the estimator μ C2 1 or μ C3 1 using a correct constraint is better than the estimator μ C1 1 using an incorrect constraint or the estimator μ 1 without using external information.Since μ C3 1 uses more information, it is in general better than μ C2 1 .Furthermore, with an incorrect constraint, μ C1 1 can be much worse than μ 1 without using external information.Note: Simulation standard deviations of γ for all cases are between 0.004 and 0.005.

The performance of μ
Ej 1 given by ( 3), ( 5), or (12) Under the same simulation setting as described in Section 5.1 but with covariate Z measured in both internal and external datasets, we compare the performance of three estimators, μ E1 1 , μ E2 1 , and μ E3 1 given by ( 3), ( 5), and ( 12), respectively, with the standard kernel estimator μ 1 in (2) without using external information.The mean integrated squared error (MISE) and improvement (IMP) are calculated using formulas ( 21) and ( 22), respectively, with μ * 1 = one of μ 1 , μ E1 1 , μ E2 1 , and μ E3 1 .Tables 5-9 present the simulation results.The relative performance of μ E1 1 , μ E2 1 , μ E3 1 , and μ 1 follows the same pattern as μ C1 1 , μ C2 1 , μ C3 1 , and μ 1 in Section 5.1.The only difference between the results here and those in Section 5.1 is that the use of more external data (a smaller n/N) results in a better performance of μ E2 1 or μ E3 1 (or μ E1 1 when it is correct).This is actually consistent with our theoretical result Theorem 4.1 in Section 4, which shows that both the squared bias B 2 a (u) and variance V a (u) in (17) are decreasing in the limit a = lim n→∞ (N − n)/n.On the other hand, the simulation results in Section 5.1 and Theorem 4.2 in Section 4 do not show a clear indication of using more external data produces better estimators.The main reason for this is that, when Z is not observed in the external dataset, the estimator μ Cj 1 relies more on internal data to recover the loss of Z from external dataset in a complicated way.

The performance of μ
Cj 1 given by ( 16) with q = 2 We re-consider the simulation in Section 5.1 but with the dimension of X to be q = 2, i.e., U = (X 1 , X 2 , Z) .We only consider normally distributed covariates with means 0, variances 1, and the correlations in (X 1 , Z), (X 2 , Z), and (X 1 , X 2 ) being 0.5, 0.5, and 0.25, respectively.Given U, the response variable      The results are shown in Table 9.Compared with results in Tables 1-4 for the case of q = 1, the MISEs in this case are larger due to the fact of having more covariates (q = 2).But the relative performances of estimators are the same as those shown in Tables 1-4.

Discussion
Curse of dimensionality is a well-known problem for nonparametric methods.Thus, the proposed method in Section 2 is intended for low dimensional covariate U, i.e., p is small.If p is not small, then we should reduce the dimension of U prior to applying the CK, or any kernel methods.For example, consider a single index model assumption (K.-C.Li, 1991), i.e., μ 1 (U) in ( 1) is assumed to be where η is an unknown p-dimensional vector.The well-known SIR technique (K.-C.Li, 1991) can be applied to obtain a consistent and asymptotically normal estimator η of η in (23).Once η is replaced by η, the kernel method can be applied with U replaced by the one-dimensional 'covariate' η U. We can also apply other dimension reduction techniques developed under assumptions weaker than (23) (Cook & Weisberg, 1991;B. Li & Wang, 2007;Ma & Zhu, 2012;Y. Shao et al., 2007;Xia et al., 2002).We turn to the dimension of X in the external dataset.When the dimension of X is high, we may consider the following approach.Instead of using constraint (15), we use component-wise constraints where k) , D = 1) using methods described in Section 2.More constraints are involved in ( 24), but estimation only involves one dimensional X (k) , k = 1, . . ., q.
The kernel κ we adopted in ( 2) and ( 16) is the second-order kernel so that the convergence rate of μ 4+p) .An mth-order kernel with m > 2 as defined by Bierens (1987) may be used to achieve convergence rate n −m/(2m+p) .Alternatively, we may apply other nonparametric smoothing techniques such as the local polynomial (Fan et al., 1997) to achieve convergence rate n −m/(2m+p) with m ≥ 2.
Our results can be extended to the scenarios where several external datasets are available.Since each external source may provide different covariate variables, we may need to apply component-wise constraints (24) by estimating h (k)  1 via combining all the external sources that collects covariate X (k) .If populations of external datasets are different, then we may have to apply a combination of the methods described in Section 2. and where σ 2 1 (•) is given in condition (C2), the second and third equalities follow from changing variables u 2 − u 1 = lν and u − u 1 = bw, respectively.From the continuity of f 1 (•) and σ 2 1 (•), Var{S 1 (u 1 , 1 )} converges to V r (u).Therefore, by the theory for asymptotic normality of V-statistics (e.g., Theorem 3.16 in J. Shao, 2003), This proves that Therefore, under the assumed condition that f 1 is bounded away from zero, Lemma 3 in Dai and Shao (2023) implies T 12 = o p (1).Note that because, under the assumed condition that f 1 is bounded away from zero, Lemma 3 in Dai and Shao (2023) implies max j=1,...,n |W j (u) − g(u) −1 g g(X j )| = o p (1).Thus, T 13 = o p (1).Consequently, T 1 has the same asymptotic distribution as √ nV, the claimed result.From Lemma 4 in Dai and Shao (2023) and (C4), T 2 = √ cA 1 (u){1 + o p (1)}.Note that where the second equality follows from (A4) and Lemmas 3-4 in Dai and Shao (2023), and the last equality follows from Lemma 2 in Dai and Shao (2023) and continuity of A 1 (•).Also, where the first equality follows from Lemma 3 in Dai and Shao (2023) and the law of large numbers, the second equality follows from Lemma 4 in Dai and Shao (2023), and the last equality follows from the law of large numbers.Similarly, where the second equality follows from Lemma 3 in Dai and Shao (2023).Under (B1)-(B5) with U and p replaced by X and q, and (C5), Lemma 8.10 in Newey and McFadden (1994) implies that max | h 1 (X i ) − h 1 (X 1 )| = O p ( log(n)n −2/(q+4) ), ( A 3 ) which is o p (1/ √ nb p ) = o p (n −2/(p+4) ) and, hence, T 5 = o p (1).From Lemma 3 in Dai and Shao (2023) and the Central Limit Theorem, Combining these results, we obtain that T 2 + • • • + T 6 = B r (u) + o p (1).This completes the proof.
Step 1: Since γ is the unique minimizer of L(t), from Theorem 2.1 in Newey and McFadden (1994)

)
The ratio f (Y | u, D = 1)/f (Y | u, D = 0) links internal and external populations so that we can overcome the difficulty in utilizing the external data under heterogeneous populations.If we can construct an estimator f (y | u, D) of f (y | u, D) for every y, u, and D = 0 or 1, then we can modify the estimator in (3) by replacing every Y i with i > n by constructed response ) we just need to estimate P(D = 1 | U = u, Y) and P(D = 1 | U = u) for every u, constructed using for example the nonparametric estimators in Fan et al. (1998) for binary response.For each estimator, both internal and external data on (Y, U) and the indicator D are used.
, f (y, u | D = 1), f (y, u | D = 0) are mth-order continuously differentiable with bounded partial derivatives, and f 1 (u) and f 0 (u) are mth-order continuously differentiable with bounded partial derivatives.Functions f (y, u | D = 0) and f 1 (u) are bounded away from zero.The bandwidths b and b satisfy n b p+1 / log(n) → ∞ and nb p / log(n) → ∞.
has bounded first-and second-order partial derivatives; and E(|Y| s | U = u, D = 1) is bounded with s > 2 + p/2.(C3) All kernel functions are positive, bounded, and Lipschitz continuous with mean zero and finite sixth moments.(C4) a = lim n→∞ (N − n)/n > 0 and the bandwidths b in (2) and l in (14) satisfy b → 0, l → 0, l/b → r ∈ (0, ∞), nb p → ∞, and nb 4+p → c ∈ [0, ∞), as n → ∞. (C5) The densities f X0 (x) and f X1 (x) for X, respectively under internal and external populations are bounded away from zero.There exists a constant s > 4 such that E(|Y| s | D = 1) and E = E{g(X)g(X) | D = 1} is assumed to be positive definite without loss of generality.The next result is about γ in (11).Theorem 4.3: Suppose that (8) holds for binary random D indicating internal and external data.Assume also the following conditions.(D1) The kernel κ in (10) is Lipschitz continuous, satisfies κ(u) du = 1, has a bounded support, and has order d > max{(p + 4)/2, p}.(D2) The bandwidth b in (10) satisfies N b2q /(log N) 2 → ∞ and N b2d → 0 as the total sample size of internal and external datasets N → ∞, where d is given in (D1).(D3) γ in (8) is an interior point of a compact domain and it is the unique solution to h 1 = a a, W t = (De tY , (1 − D) Ye −tY , D, (1 − D), DYe tY , (1 − D)Y 2 e −tY , DY 2 e tY , (1 − D)Y 3 e −tY ) , and f U is the density of U. Furthermore, there is a function τ (Y, D) with E{τ while the remaining settings are the same as in Section 5.1.In calculating MISE (21), we only a random U s,t with T = 121, not fixed grid points.Also, we consider only evaluating the performance of estimators μ Conditioned on U 1 , . . ., U n , T 13 has mean 0 and varianceVar(T 13 | U 1 , . . ., U n ) j ) = O p (b p ) = o p (1),
Note: Simulation standard deviations of γ for all cases are between 0.005 and 0.006.
Note: Simulation standard deviations of γ for all cases are between 0.005 and 0.006.
Note: Simulation standard deviations of γ for all cases are between 0.005 and 0.007.
Note: Simulation standard deviations of γ for all cases are between 0.004 and 0.006.
Note: Simulation standard deviations of γ for all cases are between 0.005 and 0.007.