Joint modeling of an outcome variable and integrated omics datasets using GLM-PO2PLS

In many studies of human diseases, multiple omics datasets are measured. Typically, these omics datasets are studied one by one with the disease, thus the relationship between omics is overlooked. Modeling the joint part of multiple omics and its association to the outcome disease will provide insights into the complex molecular base of the disease. Several dimension reduction methods which jointly model multiple omics and two-stage approaches that model the omics and outcome in separate steps are available. Holistic one-stage models for both omics and outcome are lacking. In this article, we propose a novel one-stage method that jointly models an outcome variable with omics. We establish the model identifiability and develop EM algorithms to obtain maximum likelihood estimators of the parameters for normally and Bernoulli distributed outcomes. Test statistics are proposed to infer the association between the outcome and omics, and their asymptotic distributions are derived. Extensive simulation studies are conducted to evaluate the proposed model. The method is illustrated by modeling Down syndrome as outcome and methylation and glycomics as omics datasets. Here we show that our model provides more insight by jointly considering methylation and glycomics.

In the E step, the expectation of the complete data log likelihood for one subject can be decomposed to factors that depend on distinct sets of parameters as follows, In this equation, the conditioning on x, y, z and θ ′ is dropped, to simplify notation.Given the observed data for N subjects, the factorized conditional expectations in (1) are calculated as follows.Let (T, U, T ⊥ , U ⊥ ) be the collection of row vectors (t, u, t ⊥ , u ⊥ ) for N subjects.
Here, the conditional expectations involve the rst and second conditional moments of the vector (t, u, t ⊥ , u ⊥ ) given x, y, z and θ ′ for each subject.Since the complete data vector for one subject (x, y, z, t, u, t ⊥ , u ⊥ ) follows a multivariate normal distribution with zero mean and known covariance matrix, the conditional distribution (t, u, t ⊥ , u ⊥ |x, y, z) for each subject can be calculated explicitly following lemma 3 in [3] as follows: (t, u, t ⊥ , u ⊥ |x, y, z) ∼ N (x, y, z)Σ ϵ Γ Σm , Σm , where In the M step, we set the partial derivatives of the conditional expectations in (2) to zero and get an update of each parameter.
Regarding the rst term Q {a,b,σ 2 g } , taking partial derivatives with respect to α = (a, b) This resembles the usual maximum likelihood estimator for the regression coecient in a linear regression model where Z is regressed on E [(T, H)].Taking partial derivatives with respect to σ 2 g yields the well-known maximum likelihood estimator for the residual variance Regarding Q {W,W ⊥ ,σ 2 e } which involves optimization over semi-orthogonal loading matrices W and W ⊥ , Lagrange multipliers Λ W and Λ W ⊥ are introduced.The objective function to minimize is then Note that the objective function involves both W and W ⊥ and cannot be decoupled.We adopt here the same strategy used in [3] that performs sequential optimization [7].First, (3) is where orth(A) = JV ⊤ with J and V the singular vectors of A. The last step is proven in [2].Next, (3) is optimized over W ⊥ , keeping W equal to its minimizer, In the same way, estimates for semi-orthogonal loading matrices C and C ⊥ can be obtained.
For the matrices that are restricted to be diagonal, for example, the inner regression matrix B, we set the o-diagonals to zero using its Hadamard product with an identity matrix as follows, Now the EM updates at step k can be written as follows, starting with an initial guess for S 1.2An EM algorithm for GLM-PO2PLS model with a Bernoulli distributed outcome The associated log-likelihood for the GLM-PO2PLS model with a Bernoulli distributed outcome involves an integral with respect to (ν, ξ) = ((t, u), (t ⊥ , u ⊥ )) of dimension (2K + K x + K y ).By integrating out ξ, the dimension of integral can be reduced to 2K as follows ℓ(θ; x, y, z) = log The conditional probability mass/density functions involved are given by where Note here that the determinant and the inverse of a p × p matrix Σ x|t (and a q × q matrix Σ y|u ) are required in (6).Calculating them directly is not feasible, even for a moderate p (or q).Following matrix determinant lemma [4] and Woodbury matrix identity [6], we perform the following transformation, such that only calculation of the determinant and inverse of a K x ×K x (or K y × K y ) matrix is required.Here, the transformation utilizes the semi-orthogonality The determinant and inverse of Σ y|u are calculated analogously.
In the EM algorithm, we consider the partial complete data vector (x, y, z, ν).For each current estimate θ ′ , the algorithm optimizes for each subject the objective function Similar to (1), the conditional expectation in ( 7) can be decomposed to factors that depend on distinct sets of parameters, The rst term Q {a 0 ,a,b} needs to be estimated with numerical integration for each observed instance The numerator and denominator for each subject in (9) can be approximated separately by with the function φ being log p(z|ν) and 1, respectively.
Given the observed data for N subjects, the other terms in (8) are given by The conditional expectations in (11) involve calculation of the rst and second conditional moments of ν i = (T i , U i ) for each subject i.The conditional moments are given by Here, the integrals are numerically calculated with (10).
In the M step, maximizing Q {a 0 ,a,b} requires iteration.We propose a one-step gradient descent strategy to nd an update of β = (a 0 , a, b) along the direction of the gradient given by A step size that guarantees the increase of Q β is searched using the backtracking rule [1].
To estimate the semi-orthogonal joint loading matrix W , we relax the orthogonality constraint temporarily, and obtain an intermediate estimator Ŵ * by setting the partial derivative of To impose the orthogonality constraint, the orth operator in (4) is used,
The parameters W ⊥ , σ e I p is a full-rank matrix, by setting (12) to zero, we get the following relationship, Taking trace of both sides, σ 2 e can be estimated as Note that W ⊥ and Σ t ⊥ can be obtained by eigendecomposition of the real symmetric matrices Here, power iteration is used to avoid processing a p × p matrix.Parameters in Q {C,C ⊥ ,σ 2 f ,Σu ⊥ } are estimated analogously.
Using the same notation as in (5), the EM algorithm updates parameters in step k as follows: (a0, a, b

The asymptotic distribution
Recall the GLM-PO2PLS model with a normally distributed outcome, g }, and the associated log-likelihood is given by Here, the covariance matrix Σ θ is We show here that under certain regularity conditions, consistency of the estimator θ and its asymptotic distribution N (θ, Π θ ) follows from Shapiro's Proposition 4.2 applies to the GLM-PO2PLS-Normal model.It has been shown in [3] that the proposition applies to the PO2PLS model.Similarly, we dene τ as the mapping from a θ ′ to Σ θ ′ , given in (13), and the discrepancy function F as F (S; Σ , where S is the maximum likelihood estimator of the covariance matrix of (x, y, z).The function F can be recognized as the discrepancy of two log-likelihoods evaluated at S and Σ θ ′ , respectively, with a minimizer Σ θ.The mapping function τ is analytic and quadratic in θ ′ .It follows from the denition of F and the regularity of the normal log-likelihood ℓ that F is non-negative, zero only if S = Σ θ ′ , and positive everywhere else.Also, ℓ, thus F , is twice continuously dierentiable and since GLM-PO2PLS is identifable, F has a positive denite Hessian at θ ′ .Then, Proposition 4.2 in Shapiro (1986) states that the elements of Σ θ are asymptotically normally distributed.Theorem 2 in the main article follows, Let I( θ) be the observed Fisher information matrix.To obtain Π α, the submatrix of I −1 ( θ) corresponding to α (denote I −1 (α)) has to be calculated.However, inverting I( θ) is computationally infeasible, even for moderate dimensions.Under additional assumptions that α and θ/α are asymptotically independent and σ2 g is non-random, I −1 (α) can be calculated, and be used to approximate Π α.The 2K × 2K observed Fisher information matrix I(α) is given by 14) involves conditional expectations of cubic and quadratic terms of (T i , H i ).It can be re-formulated in terms of the rst and second conditional moments µ i = E [(T i , H i )], and

Covariance matrix of the coecients
which are readily available from the E step of EM algorithm,

S 2 Simulation
In this section, we rst show the the simulation results omitted in the main article.We then describe the additional simulations and show the corresponding results.Figure S2 shows the type I error in low-dimensional settings.Overall, they were higher than those in the high-dimensional settings.In scenarios with small sample size and high noise level, the type I errors were above 10%.The type I errors decreased below 6% when the sample size was increased to 1000, and moved closer to 5% when the sample size further increased to 10000.Figure S4 shows the inner products of the estimated loading vectors and the corresponding true loadings.Overall, the loading estimation was accurate except for the x-loadings in the scenarios of small sample size, high noise level.Be reminded that in these scenarios where the x-loadings were not well estimated, the coecient a for the x-joint components in the linear predictor of z tended to be underestimated.Figure S5 shows the results regarding performance of feature selection.From Figure S5a for continuous outcome in high-dimensional settings, in the scenarios with small sample size and moderate heterogeneity, the median TPR of GLM-PO2PLS was 0.90 under low noise level and decreased to 0.62 when the dataset was more noisy.The TPR of ridge-x stayed around 0.25 regardless of the noise level.When the sample size was large, the median TPR of GLM-PO2PLS increased to 0.96 under low noise level and to 0.86 under high noise level.The median TPR of ridge-x increased to 0.50 in both noise levels.The amount of heterogeneity did not have much impact on the TPRs.Comparing Figure S5b for binary outcome to Figure S5a, GLM-PO2PLS performed similarly well, while the performance of ridge-x improved.In low dimensions (Figure S5c and Figure S5d), the TPRs of GLM-PO2PLS decreased slightly, compared to the TPRs in high-dimensional settings.On the contrary, the TPR of ridge-x increased substantially.

Low noise
Note that GLM-PO2PLS still outperformed ridge-x in low dimensions.

S 2.2 Additional simulations
We perform two additional simulations to evaluate the estimation accuracy of a and b with multiple components, and the power of the chi-square test using parameters estimated from the Down syndrome dataset.We consider here only high dimensional omics (p = 2000, q = 25) with a continuous outcome.

S 2 . 1
Simulation results of the main articleWe evaluated the performance of GLM-PO2PLS for both normally and Bernoulli distributed outcome under dierent combinations of sample size, dimensionality, heterogeneity level, and noise level.In the main article, we only described the results of coecient estimation and outcome prediction in high-dimensional settings.Here, we show the results of coecient estimation, type I error and outcome prediction in low-dimensional settings, and the results of loading estimation and feature selection in both high and low dimensions.

FigureFigure S1 :
FigureS1depicts the results of coecient estimation in low-dimensional settings.Overall, the performance was similar in low-dimensional and high-dimensional settings.In the scenarios with small sample size and high noise, the errors were larger than those in the high-dimensional settings.

Figure S2 :
Figure S2: Type I error in low-dimensional settings.The error bars show the standard errors of the estimation.The dotted horizontal line is the signicance level of 5%.

FigureFigure S3 :
FigureS3depicts the results of outcome prediction in low-dimensional settings.The performance was similar to that in the high-dimensional settings.The conclusions in high dimensional Figure S4: Performance of loading estimation.y-axis shows the inner product of the estimated loading vectors and the corresponding true loadings.Inner product of 1 suggests the loading vector is accurately estimated.Boxes show the results of 500 repetitions.
Figure S5: Performance of feature selection.y-axis shows the true positive rate calculated on the top 25% of features in x with the largest absolute loading values in GLM-PO2PLS, or with the largest absolute regression coecients in ridge regression).Boxes show the results of 500 repetitions.

S 2 . 2 . 1 Figure S7 :
Figure S7: Empirical power.The noise proportions in the outcome z is shown on the x-axis, among which 75% is estimated from the DS dataset.