Smooth and Probabilistic PARAFAC Model with Auxiliary Covariates

Abstract As immunological and clinical studies become more complex, there is an increasing need to analyze temporal immunophenotypes alongside demographic and clinical covariates, where each subject receives matrix-valued time series observations for potentially high-dimensional longitudinal features, as well as other static characterizations. Researchers aim to find the low-dimensional embedding of subjects using matrix-valued time series observations and investigate relationships between static clinical responses and the embedding. However, constructing these embeddings can be challenging due to high dimensionality, sparsity, and irregularity in sample collection over time. In addition, the incorporation of static auxiliary covariates is frequently desired during such a construction. To address these issues, we propose a smoothed probabilistic PARAFAC model with covariates (SPACO) that uses auxiliary covariates of interest. We provide extensive simulations to test different aspects of SPACO and demonstrate its application to an immunological dataset from patients with SARS-CoV-2 infection. Supplementary materials for this article are available online.


Introduction
Sparsely observed multivariate times serious data are now common in immunological studies.For each subject or participant i (i = 1, . . ., I), we can collect multiple measurements on J features over time, but often at n i different time stamps {t i,1 , . . ., t i,n i }.For example, for each subject, immune profiles are measured for hundreds of markers at irregular sampling times in Lucas et al. (2020) and Rendeiro et al. (2020).Let X i ∈ R n i ×J be the longitudinal measurements for subject i, we can collect X i for all I subjects into a sparse three-way tensor X ∈ R I×T ×J , where T = | ∪ i {t i,1 , . . ., t i,n i }| is the number of distinct time stamps across all subjects.Since {t i,1 , . . ., t i,n i } tend to be small in size and have low overlaps for different subject i, X may have a high missing rate along the time dimension.
In addition to X i , researchers often have a set of nontemporal covariates z i ∈ R q for subject i such as medical conditions and demographics, which may account partially for the variation in the temporal measurements X across subjects.Modeling such auxiliary covariates Z := (z 1 , . . ., z I ) together with X might help with the estimation quality and understanding for the cross-subject heterogeneity.
In this paper, we propose SPACO (smooth and probabilistic PARAFAC model with auxiliary covariates) to adapt to the sparsity long the time dimension in X and utilize the auxiliary variables Z. SPACO assumes that X is a noisy realization of some low-rank signal tensor whose time components are smooth and subject scores have a potential dependence on the auxiliary covariates Z: Here, (1) β ∈ R q×K describes the dependence of the expected subject score β i for subject i on z i , and (2) u ik , φ tk , v jk are the subject score, trajectory value and feature loading for factor k in the PARAFAC model and the observation indexed by (i, t, j) where u i has a normal prior N (η i , Λ f ).We impose smoothness on time trajectories (φ tk ) t=1 and sparsity on β to deal with the irregular sampling along the time dimension and potentially high dimensionality in Z.
Alongside the model proposal, we will also discuss several issues related to SPACO, including model initialization, auto-tuning of smoothness and sparsity in β and hypothesis testing on β through cross-fit.Successfully addressing these issues is crucial to practitioners interested in applying SPACO to their analysis.
In the remaining of the article, we summarize some closely related work in section 1.1 and describe the SPACO model in Section 2 and model parameter estimation with fixed tuning parameters in Section 3. In Section 4, we discuss the aforementioned related issues that could be important in practice.We compare SPACO to several existing methods in Section 5. Finally, in Section 6, we apply SPACO to a highly sparse tensor data set on immunological measurements for SARS-COV-2 infected patients.We provide a python package SPACO for researchers interested in applying the proposed method.

Related work
In the study of multivariate longitudinal data in economics, researchers have combined tensor decomposition with auto-cross-covariance estimation and autoregressive models (Fan et al., 2008;Lam et al., 2011;Fan et al., 2011;Bai and Wang, 2016;Wang et al., 2019Wang et al., , 2021)).These approaches do not work well with highly sparse data or do not scale well with the feature dimensions, which are important for working with medical data.Functional PCA (Besse and Ramsay, 1986;Yao et al., 2005) is often used for modeling sparse longitudinal data in the matrix-form.It utilizes the smoothness of time trajectories to handle sparsity in the longitudinal observations, and estimates the eigenvectors and factor scores under this smoothness assumption.Yokota et al. (2016) and Imaizumi and Hayashi (2017) introduced smoothness to tensor decomposition models, and estimated the model parameters by iteratively solving penalized regression problems.The methods above don't consider the auxiliary covariates Z.
It has been previously discussed that including Z could potentially improve our estimation.Li et al. (2016) proposed SupSFPC (supervised sparse and functional principal component) and observed that the auxiliary covariates improve the signal estimation quality in the matrix setting for modeling multivariate longitudinal observations.Lock and Li (2018) proposed SupCP which performs supervised multiway factorization model with complete observation and employs a probabilistic tensor model (Tipping and Bishop, 1999;Mnih and Salakhutdinov, 2007;Hinrich and Mørup, 2019).Although an extension to sparse tensor is straightforward, SupCP does not model the smoothness and can be much more affected by severe missingness along the time dimension.
SPACO can be viewed as an extension of functional PCA and SupSFPC to the setting of three-way tensor decomposition (Acar and Yener, 2008;Sidiropoulos et al., 2017) using the parallel factor (PARAFAC) model (Harshman and Lundy, 1994;Carroll et al., 1980).It uses a probabilistic model and jointly models the smooth longitudinal data with potentially high-dimensional non-temporal covariates Z.We refer to the SPACO model as SPACOwhen no auxiliary covariate Z is available.SPACO-itself is an attractive alternative to existing tensor decomposition implementations with probabilistic modeling, smoothness regularization, and automatic parameter tuning.In our simulations, we compare SPACO with SPACO to demonstrate the gain from utilizing Z.

Notations
Let X ∈ R I×T ×J be a tensor for some sparse multivariate longitudinal observations, where I is the number of subjects, J is the number of features, and T is the number of total unique time points.For any matrix A, we let A i: /A :i denote its i th row/column, and often write IT ) be the matrix unfolding of X in the subject/feature/time dimension respectively.We also define: Column-wise Khatri-Rao product :

smooth and probabilistic PARAFAC model with covariates
We assume X to be a noisy realization of an underlying signal array and their entries by u ik /φ tk /v jk .We let x itj denote the (i, t, j)-entry of X.Then, where Even though X is often of high rank, we consider the scenario where the rank K of F is small.Without covariates, we set the prior mean parameter η i = 0.If we are interested in explaining the heterogeneity in η i across subjects with auxiliary covariates Z ∈ R I×q , then we may model η i as a function of z i := Z i,: .Here, we consider a linear model η ik = z i β k , ∀k = 1, . . ., K. To avoid confusion, we will always call X the "features", and Z the "covariates" or "variables".
We refer to U as the subject scores, which characterize differences across subjects and are latent variables.We refer to V as the feature loadings, which reveal the composition of the factors using the original features and could assist the downstream interpretation.
Finally, Φ is referred to as the time trajectories, which can be interpreted as function values sampled from some underlying smooth functions φ k (t) at a set of discrete-time points, e.g., Recalling that X I ∈ R I×(T J) is the unfolding of X in the subject direction, we write i for the indices of observed values in the i th row of X I , and X I, i for the vector of these observed values.Each such observed value x itj has noise variance σ 2 j , and we write Λ i to represent the diagonal covariance matrix with diagonal values σ 2 j being the corresponding variances for ε ijt at indices in i.Similarly, we define { t, X T, t , Λ t } for the unfolding X T ∈ R T ×(IJ) , and { j, X J, j , Λ j } for the observed indices, the associated ob-served vector and diagonal covariance matrix for the j th row in X J ∈ R J×(IT ) .We set The marginalized log likelihood integrating out the randomness in U enjoys a closed form (Lock and Li, 2018): We can also equivalent express the marginal likelihood as below.
We use the form in eq. ( 4) to derive the updating formulas and criteria for rank selection since it does not involve the inverse of a large non-diagonal matrix.
Model parameters in eq.(3) or eq.( 4) are not identifiable due to (1) parameters rescaling and (2) reordering of different component k for k = 1, . . ., K.More discussions of the model identifiability can be found in Lock and Li (2018).Hence, adopting similar rules from Lock and Li (2018), we require 2) The latent components are in decreasing order based on their overall variances Λ f,kk + Zβ k 2 2 /I, and the first non-zero entries in V k and Φ k to be positive, e.g., v k1 > 0 and φ k1 > 0 if they are non-zero.
To deal with the sparse sampling along the time dimension and take into consideration that features are often smooth over time in practice, we assume that the time component Φ k is sampled from a slowly varying trajectory function φ k (t), and encourage smoothness of φ k (t) by directly penalizing the function values via a penalty term k λ 1k Φ k ΩΦ k .This paper considers a Laplacian smoothing (Sorkine et al., 2004) with a weighted adjacency matrix Γ.Let T (t) represent the associated time for X .,t., .We define Ω and Γ as Practitioners may choose other forms for Ω.If practitioners want φ k (t) to have slowly varying derivatives, they can also use a penalty matrix that penalizes changes in gradients over time.
Further, when the number of covariates q in Z is moderately large, we may wish to impose sparsity in the β parameter.We encourage such sparsity by including a lasso penalty (Tibshirani, 2011) in the model.In summary, our goal is then to find parameters maximizing the expected penalized log-likelihood, or minimizing the penalized expected deviance loss, under norm constraints: Only the identifiability constraint (C.1) has entered the objective.We can always guarantee (C.2) by changing the signs in V , Φ, β and reordering the components afterward without changing the achieved objective value.
Eq. ( 5) describes a non-convex problem.We will find locally optimal solutions via an alternating update procedure: (1) fixing other parameters and updating β via lasso regressions; (2) fixing β and updating other model parameters using the EM algorithm.
We give details of our iterative estimation procedure in Section 3.

Model parameter estimation
Given the model rank K and penalty terms λ 1k , λ 2k , we alternately update parameters β, V , U , s 2 and σ 2 with a mixed EM procedure described in Algorithm 1.We briefly explain the updating steps here: (1): Given other parameters, we find β to to directly minimize the objective by solving a least-squares regression problem with lasso penalty.
(2): Fixing β, we update the other parameters using an EM procedure.Denote the current parameters as Θ 0 .Our goal is to minimize the penalized expected negative log-likelihood under the current posterior distribution U |Θ 0 .We adopt a block-wise parameter updating scheme where we update V k , Φ k , Λ f and σ 2 j sequentially.Algorithm 1: SPACO with fixed penalties Data: X, Ω, λ 1 , λ 2 , K Result: Estimated V , Φ, β, s 2 , σ 2 and posterior P(U |Θ, X) and the marginalized density P(X|Θ) .
1 Initialization of V , Φ, β, s 2 , σ 2 and the posterior distribution of U ; 2 while Not converged do , ν is the largest value leading to the minimizer having 10 Update the posterior distribution of U .
11 end Algorithm 1 describes the high level ideas of our updating schemes.In line 5 and 6, we guarantee the norm constraints on V k and Φ k by adding an additional quadratic term and set the coefficient ν to guarantee the norm requirements.Even though this is not a convex problem, the proposed approaches provide optimal solutions for sub-routines updating different parameter blocks, and the penalized (marginalized) deviance loss is non-increasing over the iterations.
Theorem 3.1 In Algorithm 1, let Θ and Θ +1 are the estimated parameters at the beginning and end of the th iteration of the outer while loop.We have J(Θ ) ≥ J(Θ +1 ).
Proof of Theorem 3.1 is given in Appendix B.1.In Algorithm 1, the posterior distribution of u i for each row in U is Gaussian, with posterior covariance the same as Σ i defined earlier, and posterior mean given below.
Explicit formulas and steps for carrying out the subroutines at lines 4-7 and line 9 are deferred to Appendix A.1 4 Initialization, tuning and testing

Initialization
One initialization approach is to form a Tucker decomposition [U ⊥ , Φ ⊥ , V ⊥ ; G] of X using HOSVD/MLSVD (De Lathauwer et al., 2000) where are unitary matrices multiplied with the core tensors along the subject, time and feature directions respectively (K 1 /K 2 /K 3 is the smallest between K and I/T /J), and then perform PARAFAC decomposition on the small core tensor G (Bro and Andersson, 1998;Phan et al., 2013).We initialize SPACO with Algorithm 2, which combines the above approach with functional PCA (Yao et al., 2005) to work with sparse longitudinal data.Algorithm 2 consists of the following steps: (1) perform Algorithm 2: Initialization of SPACO 1 Let V ⊥ be the top K 3 left singular vectors of X J using only the observed columns.
3 Let Φ ⊥ be the top K 2 eigenvectors from functional PCA on the row aggregation of matrices Y(k) k = 1, . . ., K 3 .(see Appendix A.2 for details on functional PCA.) for some small δ.
5 Let U ⊥ be the top K left singular eigenvectors of Ũ , and Let G ∈ R K×K×K be the estimated core array from rearranging G.
Proof of Lemma 4.1 is given in Appendix B.2.

Auto-selection of tuning parameters
Selection of regularizers λ 1 and λ 2 : One way to choose the tuning parameters λ 1k and λ 2k is to use cross-validation.However, this can be computationally expensive even if we tune each parameter sequentially.It is also difficult to determine a good set of candidate values for the parameters before running SPACO.Instead, we determine the tuning parameters via nested cross-validation, which has been shown to be empirically useful (Huang et al., 2008;Li et al., 2016).In nested cross-validation, the parameters are tuned within their corresponding subroutines: (1) In the update for Φ k , perform column-wise leave-one-out cross-validation to select λ 1k , where we leave out all observations from a single time point.(See Appendix A.3.) (2) In the update for β :,k , perform K-fold cross-validation to select λ 2k .
Rank selection: We can perform rank selection via cross-validation as suggested in SupCP (Lock and Li, 2018).To reduce the computational cost, we deploy two strategies.(1) Early stopping: we stop where the cross-validated marginal log-likelihood is no longer decreasing.
(2) Warm start and short-run: we initialize the model parameters for each cross-validated model at the global solution and only run a small number of iterations.We found 5 or 10 iterations are usually sufficiently large in practice (the default maximum number of iterations is 10).

Covariate importance
In our synthetic experiments in Section 5, we observe that the inclusion of Z can result in a better estimate of the subject scores when the true subject scores strongly depend on Z.A natural following-up question is if we can ascertain the importance of such covariates when they exist.Here, we consider the construction of approximated p-values from conditional independence/marginal independence tests between Z j and U k : Both questions are of practical interest.
Recap on randomization-based hypothesis testing: Before we proceed to our proposal, we review some basic results for hypothesis testing via randomization.Consider a linear regression problem where Randomization test is a procedure for constructing a valid p-value without assuming the correctness of the linear model of Y on Z.
Testing for marginal independent between Y and any covariate Z j is straightforward.
Let t(Z j , y) be a test statistic.Let T := t(Z j , y) and T * := t(Z * j , y) , where Z * j is a permutation of Z j .Then, under the null hypothesis H marginal 0 , T are exchangeable with independent copies of T * .Hence, P (T > t * 1−α |y) ≤ α for any α ∈ (0, 1) where t * 1−α is the (1 − α)-percentile of the empirical distribution formed by copies T * and ∞.
Suppose that we have access to the conditional distribution of Z j |Z j c .Let t(Z j , Z j c , y) be a test statistic.Let T := t(Z j , Z j c , y) and T * := t(Z * j , Z j c , y), where Z * j is an independent copy generated from the conditional distribution Z j |Z j c and Z j c .Then, under the null hypothesis H partial 0 , T and T * have the same law conditional on Z j c and y.So Candès et al., 2018).
Oracle randomization test in SPACO: Let's now go back to SPACO and consider the ideal setting where V , Φ, s 2 and σ 2 are given.Lemma 4.2 forms the basis of our proposal.Lemma 4.2 Given V , Φ, s 2 , σ 2 , and let β * be the true regression coefficients on the covariates Z for µ , = 1, . . ., K. For any k = 1, . . ., K, we define Then, set Proof of Lemma 4.2 is given in Appendix B.3.As a result, when the term ∆ i (δ) = 0, we can apply the traditional randomization tests with response ỹi (δ) and features zi (δ) for subject i. Proposition 4.3 gives our construction of the observed and randomized test statistics .
Replacing Z j with the properly randomized Z * j to create T * marginal (δ) and T * partial (δ) (Z * j is the permutation of Z j in T * marginal (δ) and is independently generated from Z j |Z j c in T * partial (δ)).When ∆(δ) = 0, we have In practice, β for the other factors = k are also estimated from the data.When δ = 0, we could have ∆ i (δ) = 0, which renders the randomization test invalid.Thus, we typically set δ = 0.For convenience, we drop the δ argument when δ = 0, e.g., T partial = T partial (δ).
Algorithm 3 summarizes our proposal.
Thus, the signal-to-noise ratio is higher with δ = 1 if we have access to the true β * .
The conditional randomization test requires generating Z * j from the conditional distribution Z j |Z j c .We estimate the conditional distribution of Z * j via proper exponential family distribution .More details on the generations of Z * j are provided in Appendix A.4. Approximated p-value construction with estimated model paramters: The true model parameters for V , Φ, s 2 and σ 2 are of course unknown.We will substitute their empirical estimates in the above procedure.However, the substitutes from a full SPACO fit may suffer from fitting towards the empirical noise.To reduce the influence of overfitting, we use V , Φ from cross-validation as described for performing the rank selection.
That is, for i ∈ V m where V m is the index set for fold m in cross-validation, we construct , where V −m , Φ −m are estimates using fold other than V m .Σ i (0) is also estimated using V −m , Φ −m and cross-validated prior covariance Λ −m f .We refer it as cross-fit, where we estimate some model parameters using data from other folds and update the quantity of interest using them and the current fold.Since each fold is initialized at the global solution, this further encourages the comparability of the constructed ỹ and z from different folds.We observe that the cross-fit offers better Type I error control than the naive plug-in of the full estimates.

Reconstruction quality evaluation
We compare SPACO, SPACO-, plain CP from python package tensorly, and a proxy to SupCP by setting λ 1k = λ 2k = 10 −2 as small fixed values in SPACO (the additional small penalties improve numerical stability to deal with large q and high missing rate).Although this is not exactly SupCP, they are very similar.We refer to this special case of SPACO as SupCP and include it to evaluate the gain from smoothness regularization on the time trajectory.Note that we use the proposed initialization in SPACO, SPACO-and SupCP, which provide better and more stable results when the missing rate is high.A comparison between SupCP with random and proposed initialization is given in Appendix C.1.We use the true rank in all four methods in our estimation.Figure 1 shows the achieved correlation between the reconstructed tensors and the underlying signal tensors across different setups and 20 random repetitions.SNR1 represents c 1 , SNR2 represents c 2 , and each sub-plot represents different signal-to-noise ratios SNR1 and SNR2, as indicated by its row and column names.The y and x axes indicate the achieved correlation and different combinations of J and observed rate, respectively.For example, x-axis label J10 r0.1 means the feature dimension is 10, and 10% of entries are observed.The "Raw" method indicates the results by correlating the true signal and the empirical observation on only the observed entries.It is not directly comparable to others with missing data, but we include it here to show the signal level of different simulation setups.We also compare the reconstruction quality on missing entries in Appendix C.2. SPACO outperforms SPACO-when the subject score U depends strongly on Z, which could result from a more accurate estimation of the subject scores.To confirm this, we evaluate the estimation quality of U at J = 10 and SNR2= 10 and measure the estimation quality by R 2 (regressing the true subject scores on the estimated ones).In Figure 2, we shows the achieved (1 − R 2 ) for SPACO and SPACO-(smaller is better), where xaxis label represents the observing rate and column names represent the component, e.g., Comp1→ U 1 .SPACO /SPACO-are both top performers for our smooth tensor decomposition problem and achieve significantly better performance than CP and SupCP when the signal is weak and when the missing rate is high by utilizing the smoothness of the time trajectory.To see this, we compared the estimation quality of the time trajectories using SPACO and SupCP.
In Figure 3, we shows the achieved (1−R 2 ) for SPACO and SupCP at J = 10.In the x-axis

Variable importance and hypothesis testing
We investigate the approximated p-values based on cross-fit for testing the partial and marginal associations of Z with η under the same simulation set-ups.Since our variables in Z are generated independently, the two null hypotheses coincide in this setting.However, the two tests have different powers given the same p-value cut-off because the test statistics differ.
The proposed randomization tests for SPACO achieve reasonable Type I error controls.
Cross-fit is important for a good Type I error control: In Appendix C.4, we present qq-plots comparing p-values using cross-fitted V and Φ and the naive construction.We observe noticeable deviations from the uniform distribution for the later construction when the signal-to-noise ratio is low.Fig. 10 and Fig. 5 show the the achieved Type I error and power with p-value cut-offs at α = 0.01, 0.05 and with observed rate r = 0.5.The type I errors are also well controlled for r ∈ {1.0, 0.1} (Appendix C.3).

Case study
We now apply SPACO to a longitudinal immunological data set on COVID 19 from the IMPACT study (Lucas et al., 2020).Initially, the data contained 180 samples on 135 immunological features, collected from 98 participants with COVID-19 infection.We filter out features with more than 20% missingness among collected samples and imputed the missing observations using MOFA (Argelaguet et al., 2018) for the remaining features.This leaves us with a complete matrix with 111 features and 180 samples, which is organized into a tensor of size (I, T, J) = (98,35,111) where T is the number of unique DFSO (days from symptom onsite) in this data set.This is a sparsely observed tensor, and the average observing rate is 1.84 along the time dimension.Apart from the immunological data, the data set also provides non-immunological covariates.We use eight static risk factors as our covariate Z (COVID risk1 -COVID risk5, age, sex, BMI), and four symptom measures as additional responses (ICU, Clinical score, LengthofStay, Coagulopathy).We run SPACO with processed X, Z, and model rank K = 4, selected from finding the first maxima for the five-fold cross-validated marginal log-likelihood.Fig. 6 shows example plots of raw observations against our reconstructions across all samples.We see a positive correlation between these observed and reconstructed values.
Combining static covariates Z with the longitudinal measurements can sometimes improve the estimation quality of subject scores compared to SPACO-, as we have illustrated in our synthetic data examples.We can not possibly get the true subject scores in the real data set.However, we could still compare the estimated subject scores are clinically relevant by comparing them to the responses.Here, we also run SPACO-with K = 4 and examine if the resulting subject scores from SPACO associate better with the responses.
Table 1 shows the spearman correlations and their p-values when comparing the estimated subject scores and each response.The top three components C1, C2, C3 are significantly associated with the responses.Among them, C 2 uniquely captures variability for the length of stay (in hospital), and only C3 heavily depends on Z (the estimated coefficients are 0 for both C1 and C2).C3 from SPACO achieved a better association with the clinical outcomes than SPACO.Using the randomization test, we can also test for the contribution of each

Discussion
We propose SPACO to model sparse multivariate longitudinal data and auxiliary covariates jointly.The smoothness regularization may lead to a dramatic improvement in the estimation quality with high missing rates, and access to informative auxiliary covariates could improve the estimation of subject scores.We applied the proposed pipeline to COVID-19 data sets.Even though both data sets are highly sparse and of small size, SPACO identifies components whose subject scores are strongly associated with clinical outcomes of interest and identify static covariates that might contribute to the severe symptoms.A future direction is to extend SPACO to model multi-omics data.Different omics can be measured at different times and have distinct data types or scales.These motivate a tailored model's more careful design rather than a naive approach of blindly pooling all omics data together.

A Additional Algorithmic Details
A.1 Updating rules in Algorithm 1 Lemma A.1 provides exact parameter update rules used in Algorithm 1.We also define .
as the expectation of some quantity with respect to the posterior distribution of U .Let O ∈ {0, 1} I×T with O it = 1 if time t is observed for subject i and O it = 0 otherwise.
Lemma A.1 The parameter update steps in Algorithm 1 take the following forms: Here (Σ i ) k,: is the k th row of Σ i , and is the sub matrix of Λ f with the k th column and row removed.
In line 5, the update of V k considers the following problem: where In line 5, the update of V k considers the following problem: where φ tk and ν is the largest value such that the solution satisfies Φ k 2 2 = T .
• In lines 14 and 15, Derivation of Lemma A.1 is given in Appendix B.4.

A.2 Functional PCA for initializations
In Yao et al. (2005), the authors suggest a functional PCA analysis by performing eigenvalue decomposition of the smoothed matrix fitted with a local linear surface smoother.Here, we apply the suggested estimation approach on the total product matrix: Let Ŵi,s,t = k Y is (k)Y it (k) be the empirical estimate of the total product matrix for subject i, we first estimate E[ Ŵi,s,t ] via local surface smoother and then perform PCA on the estimated E[ Ŵi,s,t ]: • To fit a local linear surface smoother for the off-diagonal element of W s 0 ,t 0 , we consider the following problem: , and κ : R 2 → R is a standard two-dimensional Gaussian kernel.

A.3 Parameter tuning on λ 1k
In this section, we provide more details on the leave-one-time-out cross-validation for tuning λ 1k , ∀ k = 1, . . ., K. From eq. ( 19), the expected penalized deviance loss can be written as (keeping only terms relevant to Φ): For a given k, we define the diagonal matrix A ∈ R T ×T and the vector a ∈ R T as When we leave out a specific time point t 0 , we optimize for Φ k minimizing the following leave-one-out constrained loss, min We set A(t 0 ) as A with the (t 0 , t 0 )-entry zeroed out, and a(t 0 ) as a with a(t 0 ) zeroed out.
The leave-one-time cross-validation error is calculated based on the expected deviance loss (unpenalized) at the leave-out time point t 0 : , where φ−t 0 t 0 k is the leave-one-out estimate.To reduce the computational cost, we drop the norm constraint when picking λ 1k , which enables us to adopt the following short-cut: where Φk is the unconstrained solution from using all time points.This follows from the arguments below: 1. Drop the constraint, the original problem is equivalent to: Consider the augmented problem: where ã(t 0 ) = a(t 0 ) + δ(t 0 ), and δ t 0 (t 0 ) = A t0,t 0 φ−t 0 t 0 k and δ t (t 0 ) = 0 for t = t 0 .The augmented problem also has Φ−t 0 k as its solution.

A.4 Generation of randomized covariates for testing
In the simulations and real data examples, we encounter two types of Z j : Gaussian and binary.We model the conditional distribution of Z j given Z j c with a (penalized) GLM.
For Gaussian data, we consider a model where We estimate θ and σ 2 empirically from data.When q, the dimension of Z, is large, we apply a lasso penalty on β with penalty level selected with cross-validation.Let θ and σ2 be our estimates of the distribution parameters.We then generate new z * i for subject i from the estimated distribution z * i = z i,j c θ + * ij , with * ij independently generated from N (0, σ2 ).For binary Z j , we consider the model log P (Z j = 1) 1 − P (Z j = 1) = Z j c θ.
Again, we estimate θ empirically, with cross-validated lasso penalty for large q.We then generate z * ij independently from To generate Z * j from the marginal distribution of Z j , instead of estimating this distribution, we let Z * j be a random permutation of Z j .The randomization test requires generating Z * j from the conditional or marginal distribution of Z j , and estimating the resulting distribution of T * .We will estimate the distribution of T * by fitting a skewed t-distribution as suggested in (Katsevich and Roeder, 2020).The use of the fitted Ĝ(.) instead of the naive empirical CDF can greatly reduce the computational cost: We may obtain accurate estimates of small p-values around 10 −4 using only 200 independent generations of Z * j and the fitted Ĝ(.).

B Proofs
B.1 Proof of Theorem 3.1 The first step leads to non-decreasing marginal log-likelihood by definition, while the second step is a EM procedure.If we can show that the penalized marginal log-likelihood is nondecreasing at each subroutine of the EM procedure, we prove Theorem 3.1.
For simplicity, θ be the parameters that is being updated in some subroutine and Θ \θ parameters excluding θ.
worse compared with θ using the EM objective, it is no worse than θ when it comes to the (regularized) marginal MLE (Dempster et al., 1977).We include the short proof here for completeness.
Because the log of the posterior of U can be decomposed into the difference between the log of the log complete likelihood and the log marginal likelihood, L(U |X, Θ) = L(U , X|Θ) − L(X|Θ), we have (expectation with respect to posterior distribution of U with parameters Θ): As a result, when the following inequality holds, The last inequality is due to the fact that the mutual information E Θ log P (U |X,Θ) P (U |X,Θ ) is nonnegative.Now, we return to our subroutines for updating parameters (V , Φ, s 2 , σ 2 ): • For our subroutines of updating s 2 and σ 2 , they are defined as the maximizers of • From the proof to Lemma A.1 in deriving the explicit form for updating Φ and V (see section), we know that J(V k ) is a convex quadratic loss and J(V k ) = J j=1 a jk v 2 jk + b jk v jk for coefficients with explicit forms.If we can show that , then updating V k to this new vector is valid and will not increase our loss.This is true by standard Let ν be such a value such that V k 2 2 = 1, and let V * k be the solution at this ν.Then, Hence, the proposed strategy at line 5 in Algorithm 1 solves the original problem.
The same arguments hold for updating Φ k .
Hence, given Θ the posterior distribution at the beginning of iteration , if we update the model parameters at proposed to acquire Θ +1 , we always have J(Θ +1 ) ≤ J(Θ ).
B.2 Proof of Lemma 4.1 Then, there exists a rank-K core-array Argument I: Statement I can be checked easily: since U ⊥ spans the row space of X I , we can find a matrix A such that if we replace U * with U ⊥ A, we still have a PARAFAC decomposition of X.We can apply the same arguments to Φ ⊥ , V ⊥ , and hence prove the statement at the beginning.As a result, we need to show that U ⊥ , Φ ⊥ and V ⊥ spans the column spaces of the three unfolding matrices and G is the unfolding of such a core array along the subject dimension.
• The proposed V ⊥ satisfies the requirement by construction.Hence, we need only to check that U ⊥ and Φ ⊥ spans the row space of X I and X T respectively.
• The projection of X J onto V ⊥ results in H = V ⊥ X j = C(Φ * U * ) , where C = V ⊥ V * .Hence, we have Hence, we have The row space spanned by Ũ is the same as the row space spanned by X I , thus, the space spanned by top K left singular vectors of Ũ is the same by that of X I .As a result, U ⊥ also satisfies the requirement.
• In particular, we also have where G I is the unfolding of the core array G in the subject dimension.Hence, we recover A, B, C applying a rank-K PARAFAC decomposition on the arranged three-dimensional core array from G I as described in Algorithm 2.

B.3 Proof of Lemma 4.2
Let e k = (0, . . ., From eq. ( 4) and eq.( 5), update of β considers the objective: where Here, a jk and b jk are defined below: where Consequently, the solution for s 2 k given other parameters is s Consequently, the updating rule σ 2 h given other parameters is

C Additional numerical results
C.1 Comparison of two initialization strategies Fig. 7 compares the achieved correlations with the signal tensor when SupCP are initialized using the proposed initialization method (referred to as SupCP) and the random initialization method (referred to as SupCP random) (Lock and Li, 2018).The proposed strategy shows a clear gain in the setting of high missing rate or weak signal.

C.2 Signal reconstruction on missing entries
and estimate U ⊥ and G from the regression coefficients In a noiseless model with δ = 0 and complete temporal observations, one may replace the functional PCA step of Algorithm 2 with standard PCA.Then [U , Φ, V ] becomes a PARAFAC decomposition of 1 1+δ X.

Figure 1 :
Figure 1: Reconstruction evaluation by the correlations between the estimates and the true signal tensor.In each subplot, x axis label indicates different J and observing rate, the y axis is the achieved correlation, and the box colors represent different methods.The corresponding subplot column/row name represents the signal-to-noise ratio SNR1/SNR2.

Figure 2 :
Figure 2: Comparison of SPACO and SPACO-for reconstructing U at J = 10, SNR2= 10.In each subplot, x axis label indicates different component and observing rate, the y axis is the achieved (1 − R 2 ), and the box colors represent different methods.The corresponding subplot column/row name represents the signal-to-noise ratio SNR1/component.

Figure 3 :
Figure 3: Comparison of SPACO and SupCP for reconstructing Φ at J = 10.In each subplot, x axis label indicates different component and observing rate, the y axis is the achieved (1 − R 2 ), and the box colors represent different methods.The corresponding subplot column/row name represents the signal-to-noise ratio SNR1/SNR2.

Figure 4 :
Figure 4: Achieved type I errors at observing rate r = 0.5.In each subplot, x axis label indicates different combination of feature dimension J and targeted level α ∈ {0.01, 0.05}, the y axis is the achieved type I errors.Different bar colors represent different tests (partial or marginal).The two dashed horizontal lines indicate levels 0.01 and 0.05.

Figure 5 :
Figure 5: Achieved power at observing rate r = 0.5.In each subplot, x axis label indicates different combination of feature dimension J and targeted level α ∈ {0.01, 0.05}, the y axis is the achieved power.Different bar colors represent different tests (partial or marginal).

Figure 6 :
Figure 6: Left panel shows examples of time trajectories of four features for different subjects (horizontal time axis is the DFSO ) ; right panel shows plots of observed feature values against estimated ones from SPACO.Each line/dot represents observations from a single subject.Subjects in ICU are colored red.

Figure 7 :
Figure 7: Reconstruction evaluation of the underlying signal tensor using different initializations in SupCP, measured by the correlations.In each subplot, x axis label indicates different J and observing rate, the y axis is the achieved correlation, and the box colors represent different methods.The corresponding subplot column/row name represents the signal-to-noise ratio SNR1/SNR2.

Fig. 8
Fig.8compares the achieved correlations with the signal tensor using different methods, but only on those missing entries.The proposed strategy shows a clear gain in the setting of high missing rate or weak signal.Consistent with Fig.1, SPACO improves over SPACOwith high SNR2, and is much better than SupCP or CP when the signal size SNR1 is weak (but still estimable using SPACO) or when the missing rate is high.

Figure 8 :Figure 9 :
Figure 8: Reconstruction evaluation on missing entries by the correlations between the estimates and the true signal tensor.In each subplot, x axis label indicates different J and observing rate, the y axis is the achieved correlation, and the box colors represent different methods.The corresponding subplot column/row name represents the signal-to-noise ratio SNR1/SNR2.

Figure 10 :
Figure 10: Achieved type I errors at observing rate r = 0.1.In each subplot, x axis label indicates different combination of feature dimension J and targeted level α ∈ {0.01, 0.05}, the y axis is the achieved type I errors.Different bar colors represent different tests (partial or marginal).The two dashed horizontal lines indicate levels 0.01 and 0.05.

Figure 11 :
Figure 11: qq-plots of constructed p-values at r = 1.0.The p values are estimated with nct distribution.The red/blue points represent those from the cross-fit and the naive procedure, with blue diagonal representing the expected theoretical behavior.

Figure 12 :
Figure 12: qq-plots of constructed p-values at r = 0.5.The p values are estimated with nct distribution.The red/blue points represent those from the cross-fit and the naive procedure, with blue diagonal representing the expected theoretical behavior.

Figure 13 :
Figure 13: qq-plots of constructed p-values at r = 0.1.The p values are estimated with nct distribution.The red/blue points represent those from the cross-fit and the naive procedure, with blue diagonal representing the expected theoretical behavior.

Figure 14 :
Figure 14: qq-plots of constructed p-values at r = 1.0.The p values are estimated directly from the empirical distribution.The red/blue points represent those from the cross-fit and the naive procedure, with blue diagonal representing the expected theoretical behavior.

Table 1 :
Association between estimated subject scores and four response variables.C1-C4 represent the four components in SPACO/SPACO-.C*cor represents the achieved spearman correlation and C*pval is the p-value for the corresponding correlation test.

Table 2 :
Table2shows the p-value and adjusted p-value from the partial dependence test and marginal dependence test, with the number of randomization B = 2000.The top asso-Results from randomization test for component 3 (C3).The column nonzero counts the number of 1 for binary covariate COVIDRISK 1 -COVIDRISK 5 and sex (1 = Male, 0 = Female).The adjusted pvalues (pval*) are based on BH correction.