Graphical Model Selection for Gaussian Conditional Random Fields in the Presence of Latent Variables

ABSTRACT We consider the problem of learning a conditional Gaussian graphical model in the presence of latent variables. Building on recent advances in this field, we suggest a method that decomposes the parameters of a conditional Markov random field into the sum of a sparse and a low-rank matrix. We derive convergence bounds for this estimator and show that it is well-behaved in the high-dimensional regime as well as “sparsistent” (i.e., capable of recovering the graph structure). We then show how proximal gradient algorithms and semi-definite programming techniques can be employed to fit the model to thousands of variables. Through extensive simulations, we illustrate the conditions required for identifiability and show that there is a wide range of situations in which this model performs significantly better than its counterparts, for example, by accommodating more latent variables. Finally, the suggested method is applied to two datasets comprising individual level data on genetic variants and metabolites levels. We show our results replicate better than alternative approaches and show enriched biological signal. Supplementary materials for this article are available online.


Introduction
The task of performing graphical model selection arises in many applications in science and engineering. There are several factors that make this problem particularly challenging. First, it is common that only a subset of the relevant variables are observed and estimators that do not account for hidden variables are therefore prone to confounding. On the other hand, modeling latent variables is itself difficult because of identifiability and tractability issues. Second, the number of variables being modeled is often greater than the number of samples. It is well known that, in such a scaling regime, obtaining a consistent estimator is usually impossible without making further assumptions about the model, for example, sparsity or low-dimensionality. Finally, modeling the joint distribution over all observed variables is not always relevant. It is sometimes preferable to learn a graphical model over a number of variables of interest while conditioning on the rest of the collection.
These problems are encountered in many fields of application. In genetics, for example, one might model a gene expression network conditional on the samples' combinations of DNA variants: the variables of interest are the expression levels, while the DNA variants are included because of their predictive power and capacity to explain some of the observed correlations between genes (Stearns 2010). As genotype is not causally influenced by gene expression levels (i.e., the direction of effect only goes genotype to expression), we would like to model expression levels conditional on genotype. For another example, consider the task of modelling stock returns conditional on sentiment analysis data. The variables that encode sentiment about the stocks have value (Li et al. 2014), but modeling their joint distribution might be difficult and unnecessary, hence the need for conditioning. Moreover, a number of unmeasured variables (e.g., energy prices) might impact many stocks and should be modeled for better predictive accuracy .
The problem of learning a Gaussian graphical model in the presence of latent variables was considered by . They suggest estimating an inverse covariance matrix which is the sum of a sparse and a low-rank matrix. Another partial solution to our problem was introduced independently by Sohn and Kim (2012) and Wytock and Kolter (2013) who defined the concept of a sparse Gaussian conditional random field: a regularized maximum likelihood estimator that learns a Gaussian graphical model over a subset of the variables (X, say) while conditioning on the remaining variables (Z, say). , Sohn and Kim (2012), and Wytock and Kolter (2013) made significant advances to the problem of model selection in general graphical models, but there exist many situations, where we may wish to allow for latent variables and condition on some of those measured. Here, we suggest learning a Gaussian conditional random field in the presence of latent variables and introduce a novel regularized maximum likelihood estimator which fits into the "low-rank plus sparse" framework (Chandrasekaran et al. 2009;Candès et al. 2011). In our setting, inputs (variables in Z) are allowed to act on the outputs (X) in both a sparse and a low-rank fashion, while the inverse covariance matrix over X is estimated conditional on Z and on the marginalized latent variables. As will be shown later, this approach allows us to correctly recover graphs that are typically denser and with more hidden variables than the ones that can be handled by other methods.
From both a theoretical and a computational point of view, modeling latent variables with a conditional random field gives rise to a number of complications (e.g., the proximal operator is not defined in a closed form) that we address in this article. In particular, we derive convergence bounds for our estimator and show that under suitable identifiability conditions it is consistent in the high-dimensional regime as well as "sparsistent" (i.e., capable of recovering the graph structure). We then show how the alternating direction method of multipliers (Boyd et al. 2010) and semi-definite programming techniques can be employed to fit the model to thousands of variables. Through extensive simulations, we illustrate the conditions required for identifiability and show that there is a wide range of situations in which this model performs significantly better than its counterparts. In order to show how our model behaves in a realistic setting, we apply the present estimator to two datasets comprising genetic variants and metabolite levels. Both replication and a test statistic constructed using an independent source of validation suggest that our estimates have more biological relevance than the results obtained via other methods.

Setup
Throughout, we consider n independent, identically distributed realizations of a zero-mean random vector Y ∈ R m+p+h . Y is indexed by disjoint subsets of {1, . . . , m + p + h}, denoted Z, X, H and with respective cardinality m, p, and h. They correspond to the variables we wish to condition on, the variables we wish to model and the hidden variables. We write Y Z (resp. Y X and Y H ) for the subvector of Y indexed by Z (resp. X and H). Our main assumption is that the distribution of Y X Y H ∈ R p+h conditional on Y Z ∈ R m is normal and that its mean is a linear combination of the inputs Y Z . Thus, conditioning on Y Z , Y X follows a multivariate normal distribution. More precisely, we assume a Gaussian conditional random field parameterized as follows: where we have used partitioned matrices to show the contributions of the observed and hidden variables.
The superscript − * is used to indicate that these matrices are parameters of the model, as opposed to estimates. Note that there are no distributional assumptions about Y Z .
Finally, we assume that variables indexed by H are unobserved. Accordingly, we compute the marginal distribution Y X |Y Z , which yields T . This expression follows straightforwardly from the formula for the inverse of a partitioned matrix (the full derivation is given in the supplementary materials). From Equation (2.1), the log-likelihood function can be expressed in terms of the sample covariance matrices For clarity, all terms related to a given subset will be dropped from the expression when the subset is empty. For example, whenever Z = H = ∅ the log-likelihood becomes (S X ; X ) = log det S X − Tr( n X S X ). Note that our assumption about the Gaussianity of X, H plays an important role in the interpretation of the parameters (M * X , M * XH , . . .). Under this assumption, it is well known that the structure of the conditional Gaussian graphical model (GGM) over X, H can be read-off these matrices directly by looking at the location of their nonzero entries (Lauritzen 1996). Briefly, a graphical model is a statistical model defined according to a graph, whose nodes are random variables and whose edges encode conditional independence statements between variables (Lauritzen, 1996) Note that since the conditional mean vector is a linear transformation of Y Z , this interpretation of the non-zero entries of M * ZH and M * ZX holds irrespective of Y Z 's distribution.

Goal
In typical applications such as the ones mentioned in the introduction, S * X is the target. Since it encodes the structure of the graphical model over X, recovering S * X can provide insight into the causal mechanisms underpinning the data but, in general, hidden variables make it impossible to access this parameter directly. Instead, it follows from Equations (2.1) and (2.2) that only the sum S * X − L * X can be inferred (similarly, only S * ZX − L * ZX is accessible). The maximizer of the log-likelihood (2.2) is not unique and the problem is fundamentally misspecified. We are therefore facing two related, but distinct, problems: r identifiability: under which conditions does the problem admit a unique solution? Ideally, these conditions ought to be as broad as possible so that they will be met in realistic situations. Note that unlike the breakdown caused by the high-dimensional regime, this kind of non-identifiability is more fundamental and remains no matter how large the number of samples.
r consistency: provided there exists a unique solution, can we derive a consistent, tractable estimator which is capable of recovering (S * X , L * X , S * ZX , L * ZX )? Here, we chose to focus on S * X because it fits our application but there might be situations in which other parameters are of interest, for example, S * ZX in Zhang and Kim (2014).

Previous Work
In practice, model selection in the context of GGMs is often performed using 1 -regularized maximum likelihood estimators (MLEs) such as the ones introduced by Banerjee, El Ghaoui and d' Aspremont (2008); Yuan and Lin (2007), and the socalled graphical lasso (Friedman, Hastie, and Tibshirani 2008). The 1 -norm is the convex envelope of the 0 unit ball and is therefore a natural convex relaxation to learn sparse matrices.
Building on the success of the graphical lasso, estimators of the form "log-likelihood" + "non-Euclidian convex penalty" have received considerable interest (Chandrasekaran, Recht, Parrilo, and Willsky 2012). A relevant example is the use of the nuclear norm (i.e., the sum of the singular values) as a convex relaxation for learning low-rank models (Bach 2008). Beyond their attractive computational properties, the 1 and nuclear norm regularized MLEs enjoy strong theoretical guarantees (Bach 2008;Ravikumar et al. 2011). Using penalized MLEs, the questions raised above (Section 2.1) have been solved in some special cases of model (2.1).

Sparse Gaussian Conditional Markov Random Field: H = ∅
When H is empty, (2.1) reduces to The log-likelihood associated with this model is convex and maximum-likelihood estimates can be obtained in closed form. In order to increase the interpretability of the estimates and cope with high-dimensionality, Sohn and Kim (2012); Wytock and Kolter (2013) suggested the following estimator of (S * X , S * ZX ): with λ n > 0. The entries of both S X and S ZX are being shrunk in order to jointly learn a pair of sparse matrices describing the direct effects of Z on X and the graph over X. Wytock and Kolter (2013) studied the theoretical properties of this estimator and derived a set of sufficient conditions for the correct recovery of S * X and S * ZX . Among other results, they showed that this approach often outperforms the graphical lasso in terms of predictive power and model selection accuracy. Alternative parameterizations and approaches have been suggested in the multivariate linear regression literature. We refer the reader to Yin and Li (2011); Sohn and Kim (2012), and references therein for more details on these estimators and their relative performances.

Low-Rank Plus Sparse Decomposition: Z = ∅
The presence of latent variables (H = ∅) is a substantial complication. As explained earlier, the marginal precision S * X − L * X is then the sum of two matrices and the problem is fundamentally misspecified. However, following the seminal work of Candès et al. (2011) and Chandrasekaran et al. (2009), Chandrasekaran, Parrilo and showed that it is sometimes possible to correctly decompose S * X − L * X into its summands. Loosely speaking, this is the case if S * X is sparse and there are few hidden variables with an effect spread over most of the observed variables. As a result,  introduced an estimator which penalises the 1 -norm of S X and the nuclear norm of L X as follows: Here, ||L X || * denotes the nuclear norm of L X (i.e. the sum of its singular values). Among other useful results,  showed that this estimator is, under suitable conditions, sparsistent and "ranksistent": the sign patterns of both the entries of S and the spectrum of L can be recovered exactly.

Suggested Estimator
As hinted in the introduction, there are many cases where one might want to both condition and allow for latent variables. In such cases, neither the sparse Gaussian conditional Markov random field nor the low-rank plus sparse approach would be optimal. Building on these estimators, we propose decomposing the parameters of a Gaussian conditional Markov random field into the sum of a low-rank and a sparse matrix. To that end, we suggest optimizing the following regularized MLE Solving (2.4) amounts to minimizing a function which is jointly convex in its parameters over a convex constraint set (proofs are in the supplementary materials, along with other elementary properties of the likelihood). As mentioned earlier, this likelihood is structured around two parameters, S ZX and S X , accounting respectively for the direct (i.e., conditional on all variables) effects of Z on X and the structure of the graph over X. However, because we penalise the nuclear norm of L, the effect of all latent variables is modeled jointly and a single set of latent factors is learned. No distinction is being made between the variables that "mediate" the action of Z and the ones that act as confounders on X. On the other hand, the parameters S X and S ZX retain their interpretability.

Theoretical Analysis
According to our assumptions, we assume here that each sample is generated according to the model and ask under what circumstances Estimator (2.4) correctly recovers the parameters S * , L * (as built by stacking S * X , S * ZX and L * X , L * ZX ) with overwhelming probability. We analyze this problem in the framework of Chandrasekaran, Parrilo, and Willsky (2012) and therefore our proofs often mirror theirs. However, because of the form taken by the likelihood and because we do not limit ourselves to square matrices, the analysis is more involved.
As mentioned earlier, modeling latent variables by decomposing the parameters into a sum of two matrices raises identifiability issues: given samples drawn from (3.1), when is it possible to exactly decompose the sum S − L (where S, L are defined as before) into its summands? This is a problem which has been tackled in great generality in  and their results directly apply to the present situation: they are expressed in terms of the Fisher information matrix but do not explicitly involve the likelihood function. For that reason, key definitions, as well as assumptions necessary for our result to hold, are deferred to the supplementary materials.
Here we focus on the original contributions of this article by giving an intuition for these conditions before formally stating the consistency of the estimator defined by (2.4).

Identifiability
Until now, it was repeatedly mentioned that a "low-rank plus sparse decomposition" is possible when S is sparse and L is lowrank. However, it is clear that imposing conditions on the sparsity of S and the rank of L is not sufficient. For example, consider a matrix with a single entry: it is at the same time sparse and low-rank and there is, therefore, no unique way of decomposing it into the sum of a low-rank and a sparse matrix. Chandrasekaran et al. (2009) introduced the notion of rank-sparsity incoherence and define quantities that make it possible to express the conditions under which such a problem is well-posed, even for arbitrary matrices. Two concepts are particularly important (precise mathematical statements and explanations can be found in the supplementary materials): r ξ (T (L * )): a small ξ (T (L * )) guarantees that no single latent variable will have a strong effect on only a small set of the observed variables. It is closely related to the concept of incoherence found in Candès et al. (2011). r μ( (S * )) quantifies the diffusivity of S's spectrum. It can be shown that matrices with few nonzero entries per row/column (and thus sparse) have a small μ. A sufficient condition for identifiability can be expressed in terms of ξ, μ by requiring that their product be small enough (ξ (T (L * ))μ( (S * )) ≤ 1 6 C 2 ) and that the tuning parameter γ be chosen within a given range (γ ∈ [ 3ξ (T (L * )) C , C 2μ( (S * )) ]), for some constant C which depends on the Fisher information matrix (FIM). In other words, there must be a small number of latent variables acting on many observed ones and S * must not have too many non-zero entries in any given row or column. This is a condition on the parameters S * , L * and it is related to the problem of decomposing the sum of two matrices. Moreover, it can be shown that natural classes of matrices satisfy these assumptions. In particular, the degree of S * (q) and number of latent variables (h) are allowed to grow as a function of the problem size p, m (Chandrasekaran et al. 2009). For example, under some assumptions about the distribution from which L * is sampled, one shows that ξ (T (L * )) ∼ h p and that scaling regimes of the form q ∼ log(p + m) b and h ∼ p log(p+m) 2b (for 0 ≤ b < ∞) guarantee identifiability with high probability (see Section 4.2 in . We call the restrictions on ξ, μ, and γ Assumption 1. Another issue is that one does not directly observe S * − L * but samples generated from (3.1). All lasso-type methods face this problem and conditions on the FIM are usually imposed (the so-called irrepresentability condition) (Ravikumar et al. 2011). Similar assumptions about the FIM are made here and detailed in the supplementary materials. This is Assumption 2.

Consistency
We can now present our main result and state the consistency of Estimator (2.4) (see supplementary materials for the proof). First, let us recall that for any matrix P, ||P|| 2 denotes its largest singular value and ||P|| ∞ is its largest entry in magnitude. We can then define the following quantities: We prove the following theorem in the supplementary materials (Q 1 to Q 6 are constants whose definitions are deferred for clarity): Theorem 1 (Algebraic Consistency). Suppose that Assumptions 1 and 2 hold and that we are given n samples drawn according to (3.1). Further assume that the following hold: (a) n ≥ pM ξ (T (L * )) 4 max(2, 256ψ * X 2 W 2 ). (b) (σ min and θ min conditions) Let the minimum nonzero singular value σ of L * and the minimum nonzero entry of S * in magnitude θ be such that .
Seen at a high-level, Theorem 1 is analogous to the result obtained by  for the low-rank plus sparse (LR+S) estimator. A particularly important feature is that Assumption 1 holds even the degree of S * and the rank of L * grow with the dimensions of the problem. Taking as example the scaling regime mentioned in Section 3.1, we see that n p log(p + m) 4b M samples are required for Theorem 1 to hold with high probability. This is also enough to guarantee that max( 1 γ Ŝ − S * ∞ , L − L * 2 ) = o P (1), but it should be contrasted with the logarithmic scaling usually encountered in the 1 -regularized literature, for example, one asks log(pm) n = o(1) for the SCGGM estimator of Wytock and Kolter (2013).
In order to further compare the convergence rates of our estimator and LR+S, a few points are worth considering.
First, we do not make any distributional assumptions about Y Z and there are therefore many scenarios in which only Theorem 1 applies. For the sake of comparison, we can assume that Y Z follows a normal distribution so that the consistency theorem of LR+S is applicable. Since LR+S does not model a conditional distribution, Z and X are modeled jointly. The estimated matrices, (Ŝ LR+S ,L LR+S ), are of size (p + m) × (p + m) and, to obtainŜ X ,Ŝ ZX , . . . , the relevant sub-matrices are extracted from the larger (p + m) × (p + m) estimates. Considering the scaling described in Section 3.1, we see that high-dimensional regimes of the form p n + m n = O(n 1−a ) for 0 < a ≤ 1 cover interesting applications and are enough to guarantee consistency. To see why the convergences rates are comparable, start by noticing that under Assumption 1 of either theorem m n = o(n) is required for consistent estimation. Now, since Y Z follows a normal distribution, we have that lim n→∞ ψ Z = σ 2 (1 + √ y) 2 , for some 0 ≤ σ < ∞ and where y = lim n→∞ m n /n (see, e.g., Th. 5.11 in Bai and Silverstein 2009). Therefore, ψ Z = O(1) and p n M n = O(( √ p n + √ m n ) 2 ) = O(p n + m n ) so that assuming p n + m n = O(n 1−a ) for 0 < a ≤ 1 is also enough for consistent estimation in many settings of interest.
Second, μ and ξ play an identical role in both Theorem 1 and (Chandrasekaran, Parrilo, and Willsky 2012, Theorem 4.1), namely through Assumption 1 and conditions (a) and (b). However, these quantities are usually different (i.e. μ( (S * )) = μ( (S * LR+S )), ξ (T (L * )) = ξ (T (L * LR+S ))), which has interesting implications. An obvious consequence is the one stated in the previous section: since μ, ξ define the acceptable range for γ , its span can vary widely across methods. More importantly, one shows that conditions (a) and (b) are driven by the lower-end of that range. Should it be assumed instead that γ = C 2μ( (S * )) (the upper-end), all three conditions would be relaxed (Chandrasekaran, Parrilo, and Willsky, 2012, Corollary 4.2). Thus, the smaller the value of ξ (T (L * )), the wider the acceptable range and the more likely Theorem 1 is to hold.

Optimization
Optimizing (2.4) in the high-dimensional setting is a challenging problem. For example, some of the constraints are hard to accommodate (e.g., S X − L X 0, L X 0) and the penalty terms are non-smooth. Fortunately, (2.4) has similarities with (2.3) (the estimator of Chandrasekaran, Parrilo, and Willsky 2012) and we can rely on algorithms that have proven effective on (2.3), namely the alternative direction method of multipliers (ADMM) (Boyd et al. 2010;Ma, Xue, and Zou 2013;Ye, Wang, and Xie 2011) and approaches relying on semi-definite programming (SDP) (Vandenberghe and Boyd 1996;Wang, Sun, and Toh 2010;Tütüncü, Toh, and Todd 2003). The general theory behind both ADMM and SDP is applicable to the problem at hand but features that are specific to (2.4) prevent a straightforward application of existing algorithms. SDP is an active field of research and recasting (2.4) within that framework makes it easier for the reader to use existing software and even benefit from future advances in that field. On the other hand, our ADMM implementation is tailored to the problem at hand but converges to a reasonable accuracy quickly. This is why we discuss both strategies. Technical details and step-by-step derivations are given in the supplementary materials.

The Alternating Direction Method of Multipliers
The alternating direction method of multipliers (ADMM) is a first-order optimization procedure which is well-suited to the minimization of large-scale convex functions. It proceeds by decomposing the original problem into more amenable subproblems which are then solved iteratively (Boyd et al., 2010). It is sometimes possible to obtain closed-form solutions to these subproblems but this is not required for ADMM to converge: even inexact iterative methods can be employed (Eckstein and Bertsekas 1992;Goldstein and Osher 2009). Moreover, only a few tens of iterations are necessary for ADMM to converge to an accuracy which is sufficient for most applications 1 (Boyd et al., 2010). ADMM (and related algorithms such as Bregman iterations and Douglas-Rachford splitting) has been celebrated as an efficient and robust general-purpose algorithm for 1regularized problems (Goldstein and Osher, 2009).
More recently, Ye, Wang, and Xie (2011) and Ma, Xue, and Zou (2013) used ADMM to solve (2.3) and showed that it can be optimized by iteratively solving four smaller subproblems (Ye, Wang, and Xie 2011). A similar decomposition is applicable to the problem at hand but, in the case of (2.4), one of the subproblems requires the computation of a so-called proximal operator which does not admit a closed-form solution. Consequently, we derived an algorithm which iteratively converges to this proximal operator. In practice, we found that only a few iterations (typically less than 10) of this subprocedure are necessary to obtain a good approximation to the proximal operator.

Recasting the Objective Function as a Semi-Definite Program
The solvers made available in the MATLAB® packages SDPT3 and Logdet-PPA are capable of solving problems of the form (Tütüncü, Toh, and Todd 2003;Wang, Sun, and Toh 2010): arg min X 1 ,X 2 ,...
Tr X 1 C T 1 + Tr X 2 C T 2 + · · · + a 1 log det(X 1 ) (4.1) subject to a number of linear, quadratic and positive semidefinite constraints 2 . Our goal is then to recast (2.4) as a problem  However, converging to a very high accuracy can be slow in comparison to second-order methods.  This is only a subset of the problems that can be tackled by such packages. See references for a formulation of this problem in its full generality.
of the same form as (4.1). We show in the supplementary materials that (2.4) admits the following SDP reformulation: arg min (4.2) (4.2) can easily be implemented in e.g. YALMIP and solved using LogdetPPA or SDPT3 (Löfberg 2004;Wang, Sun, and Toh 2010;Tütüncü, Toh, and Todd 2003). We remark that the objective function is now smooth (as opposed to (2.4)) but contains many more variables and constraints.

Simulations
We now study the properties of the proposed model on synthetic data and compare its performances to the three other methods introduced earlier: the graphical lasso (GLASSO) (Friedman, Hastie, and Tibshirani 2008), the sparse conditional Gaussian graphical model (SCGGM) (Sohn and Kim 2012; Zhang and Kim 2014; Wytock and Kolter 2013) and the low-rank plus sparse decomposition (LR+S) (Chandrasekaran, Parrilo, and Willsky 2012). The suggested approach will henceforth be referred to as LSCGGM (i.e., latent sparse conditional Gaussian graphical model).
In Section 3, it was established that assumptions about both the nominal parameters (S * , L * ) and the Fisher information matrix are necessary to guarantee the identifiability of the problem and, subsequently, the applicability of Theorem 1. In particular, we recalled the key role played by the maximum degree of S * and the incoherence of L * . To better understand when these assumptions are expected to hold, we simulate data from a set of graphical models that span the range of possible latent structures and measure the ability of the different methods to recover the underlying graphs.

Graphical Structures and Methods
The set of graphical structures we simulate from is constructed in such a way that only two integers, d Z and d H , describe the relevant properties (rank, sparsity, incoherence, degree) of S * ZX − L * ZX and L * X , respectively. Thus, d Z controls the relationship between inputs (Z) and outputs (X ) while d H encodes the behavior of L * X . The remaining parameter, S * X , remains unchanged throughout. We now briefly describe how the graphs are constructed but defer technical details to the supplementary materials (e.g., distribution of effect sizes). The code used to generate the data and fit our model is made available with this article.
For all simulations, each observation is generated according to a model of the form with Y Z a random vector of size p whose entries are drawn independently from a t-distribution with 4 degrees of freedom. Y X is also of size p. Here, Y X and Y H are drawn jointly from a conditional random Markov field but only Y X and Y Z are observed, which implies that L * The matrices S * X , L * X , and M * ZX are constructed as follows. The nonzero pattern of the p × p matrix S * X is identical across all simulations and is similar to the one adopted by Wytock and Kolter (2013): the graph over X is a chain of p variables in which one link out of five has been removed. The non-diagonal entries of S * X are such that S * X i j = 0, if and only if i = j + 1 and i ≡ 0 (mod 5).
As stated above, the rank/sparsity of L * X is described by a single integer, d H . Specifically, we assume that p is an integer of the form p = 2 k and pick d H ∈ {0, 1, . . . , k}. Then, for a fixed value of d H , M * H and M * ZX are random matrices constructed so that: (a) there are exactly 2 d H confounders, that is, the rank of L * X is 2 d H ; (b) each of the 2 d H confounders impacts exactly p/2 d H outputs; (c) each output is connected to exactly one latent variable. Thus, when d H = k, there is effectively no confounding since latent variables and outputs are in a one-to-one correspondence. When d H is much smaller than k, there are few confounders with an effect spread over many observed variables. When d H is set close to k, there are many hidden variables, each affecting a handful of outputs-a gross violation of the identifiability assumptions.
Likewise, d Z accounts for the structure of M * ZX . Here again, we assume p = 2 k and pick d Z ∈ {0, 1, . . . , k}. Then, M * ZX is designed to satisfy: (a) rk(M * ZX ) = 2 d Z ; (b) each row/column of M * ZX has exactly p/2 d Z nonzero entries. The effect of d Z is easily interpreted. For example, d Z = k is an ideal situation, where inputs and outputs are in a one-to-one correspondence. As d Z goes from k to 0, M * ZX becomes denser and increasingly incoherent. When d Z is close to k, M * ZX is estimated as a sparse matrix. When d Z is small, its decomposition is a single low-rank matrix.
Finally, since neither GLASSO nor LR+S model conditional distributions, we use these estimators as described in Section 3, i.e., by first modeling Z and X jointly and then extracting submatrices of the estimates.

Results
In our simulations, we set p = 32, n = 3000 and let (d Z , d H ) take values in {2, 3, 4, 5} 2 . Each of these 16 designs is replicated 20 times, for a total of 320 distinct datasets.
Here, we are interested in recovering the structure of S * X and we use precision/recall curves as a metric, thus ignoring the rank of the latent component. LR+S and LSCGGM both have two tuning parameters (λ and γ ). For each value of γ , one obtains a distinct precision/recall curve by varying λ. For each of the 320 simulated datasets, we computed the paths corresponding to 15 distinct values of γ and subsequently selected γ so as to maximise the area under the curve (AUC). Figure 1 shows the average precision/recall curves obtained by applying this procedure. First, we see that known methods behave as expected: GLASSO behaves best when there is no confounding and Z acts in a sparse fashion (d H = d Z = 5) ; SCGGM is more robust to changes in d Z , but this is restricted to situations in which there is not confounding (d H = 5); LR+S performs best when d H = 5 or when there is low-rank, diffuse confounding (d H = 2). In a number of cases, the method proposed here is better than any of the alternative methods and, in the worst cases, it offers comparable performances. Specifically, it outperforms LR+S significantly when both inputs and hidden variables act on the outputs through a relatively low-rank mechanism (d Z = 2, 3; d H = 2, 3). Two factors might explain this behavior: (a) the inputs are not normally distributed, which violates the assumptions of LR+S; (b) the data are generated according to a conditional random Markov field, which is not assumed by LR+S, and may result in a violation of its identifiability assumptions. d H = 4 corresponds to the extreme situation in which each latent variable confounds exactly two random variables. None of the methods performs well but LR+S and LSCGGM behave better than GLASSO and SCGGM in scenarios where one would not expect any differences (e.g., d Z = d H = 5). This is because LR+S and LSCGGM have two tuning parameters, one of which (γ ) is chosen with perfect knowledge: it improves the AUC of these methods but causesL X to be non-zero. Additional simulations made available in the supplementary materials show that when γ is chosen with cross-validation, the selected value of γ is indeed often too small.
Both LR+S and LSCGGM have two tuning parameters (λ, γ ): λ controls the overall shrinkage on the sparsity/rank of the estimates, γ accounts for the trade-off between sparse and low-rank components. To better understand the role of γ , we look at the precision/recall curves obtained for various values of this tuning parameter. As suggested in Chandrasekaran et al. (2009), the penalty term is reformulated as λ(γ ||S|| 1 + (1 − γ )||L|| * ) with γ ranging from 0 to 1 instead of (0, +∞). By analogy to the AUC metric, we report the "volume under the surface" (VUS) which accounts for the effect of both regularization parameters.
In Figure 2, the surfaces obtained for (d H = d Z = 2) and (d H = 5, d Z = 3) are plotted. They show that the suggested approach is less sensitive to γ than LR+S, thus making it easier to pick a sensible value in real-world applications. Figure 2(b) illustrates what happens when both methods offer comparable performances according to Figure 1 (which is obtained by choosing γ perfectly): compared to LSCGGM, there are actually very few values of γ for which LR+S achieves its best AUC. Here, only two of the 16 possible surface plots are shown, but Figure 2(c) indicates that LSCGGM is less sensitive to this tuning parameter across all simulation designs, as measured by the VUS. In particular, we have consistently observed that upper-end of the acceptable range for γ is higher for LSCGGM than LR+S. The next simulations illustrate the implications of this property.
In these simulations, our main concern was to illustrate how methods differ in terms of identifiability and consistency. Setting p and m to a relatively small value (32) made it possible to capture most scenarios with only 16 graphical structures. In the supplementary materials, we simulate from larger graphs (p = m = 2 7 = 128, n = 3000) and obtain results that are similar to the ones showed here. We also report the estimation errors for the other L * along with the precision/recall curves for S * ZX . Finally, we look at the effect of choosing γ using cross-validation. In the next section, we show how one can select λ and γ when some control over the number of falsely discovered edges is expected.

Application: Using Genetic Information to Detect Relationships Between Human Metabolites
To illustrate the value of our new approach, we now apply it to a dataset combining human metabolite levels and genetic markers. Here, metabolites play the role of the variables indexed by X while genetic variants are the inputs, Z. For comparison purposes, we also report the results obtained with the low-rank plus sparse method (LR+S) 3 .

The Avon Longitudinal Study of Parents and Children (ALSPAC)
The Avon Longitudinal Study of Parents and Children (ALSPAC) is a cohort study of children born in the county of Avon during 1991 and 1992 (Boyd et al. 2012;Fraser et al. 2012 The data at our disposal contain genetic and phenotypic measurements on approximately 8,000 children and their mothers. We first performed our entire analysis on the children's cohort (called "Child cohort" throughout) and then independently applied the same procedure to the mothers' cohort (Mother cohort). We modeled the levels of 39 metabolites. Measurements for all 39 variables were available without missing data for 5242 children and 2770 mothers. In each cohort, independent genetic variants were selected based on their predictive power with respect to any of the 39 traits under study: 133 and 44 variants were selected in the Child and Mother cohorts, respectively. Metabolite levels being continuous variables, they were quantile normalized and standardized. Genotypes, on the other hand, were encoded as ternary variables (0/1/2).

Methods
Since both the suggested approach (LSCGGM) and the LR+S method have two tuning parameters (λ, γ ), some procedure is required in order to set these parameters to appropriate values. As shown by both theoretical results and simulations, solutions are expected to be identical for a range of values of γ . Consequently, we do not select a single value of γ but consider instead 30 values within the range (0.02, 0.98) 5 . To each γ corresponds a regularization path: a graph along each path is selected using "pointwise" complementary pairs stability  The penalty is parameterized as λ(γ ||S|| 1 + (1 − γ )||L|| * ), so that γ ∈ (0, 1). selection (Meinshausen and Bühlmann 2010;Shah and Samworth 2013). Following the approach used in Meinshausen and Bühlmann (2010), the threshold on the inclusion probabilities is chosen by requiring that the expected number of falsely discovered edges be at most one: E(V ) ≤ 1 (using their notations). Thus, for each method and each cohort we obtain a collection of 30 graphical structures.
In order to measure how similar two graphical structures are, we consider their edge sets. For any pair of undirected graphs G 1 = (V 1 , E 1 ), G 2 = (V 2 , E 2 ), we define their similarity by their Jaccard Index This measure has two uses: (1) it makes it possible to select γ by measuring how the estimates relate to each other as γ varies from 0 to 1; (2) it allows us to measure how well the findings are replicated across cohorts. Another important step is assessing the biological relevance of the estimates using an external source of information. We used ChEBI: an ontology of small chemical entities of biological interest (Hastings et al. 2012). We manually matched all 39 metabolites to their ChEBI IDs and annotated them using the ontology. Using such annotations, one can compute an "enrichment statistic" reflecting whether a given graph contains edges between related metabolites more often than would be expected in a random graph with a similar topology (such a graph has an expected statistic of 1). We defer the definition of this statistic to the supplementary materials but remark that this method is close to the ontology analyses frequently encountered in computational biology . By randomly permuting the annotations, empirical p-values for this statistic can also be computed.

Results
First, we can ask how sensitive the estimates are to the tuning parameter γ . Indeed, as pointed out earlier, one would expect to see a "stable region": a range of values of γ for which there is little variation. One would typically select a graph estimated with a γ within this region. LetĜ (γ ) LSCGGM,Ch (resp.Ĝ (γ ) LR+S,Ch ) denote the graph returned by LSCGGM (resp. LR+S) for a given value of γ in the Child cohort. For every pair (γ 1 , γ 2 ), Figure 3(a) shows how similar the estimates are to each other (as computed by J(Ĝ (γ 1 ) LSCGGM,Ch ,Ĝ (γ 2 ) LSCGGM,Ch )). In the range 0.6 ≤ γ 1 , γ 2 ≤ 0.9, they are very close to each other. For small values of γ , the graphical structures returned by LSCGGM vary smoothly with γ . The regime γ ≤ 0.05 corresponds to the case in which the rank of the latent component is 0: LSCGGM behaves like a sparse conditional graphical model. Similar figures can be found for the LR+S method and the Mother cohort in the supplementary materials.
Having established that both methods exhibit a stable region, we look at how close the estimates found in these regions are. To that end, we plot J(Ĝ (γ 1 ) LSCGGM,Ch ,Ĝ (γ 2 ) LR+S,Ch ) for all pairs (γ 1 , γ 2 ) (Figure 3(b)). For small values of γ , LR+S and LSCGGM appear indistinguishable. However, for γ 1 , γ 2 > 0.5 their Jaccard Index drops to reach values around 0.3-0.4. But the range γ > 0.5 covers precisely the stable regions of both LSCGGM and LR+S, thus indicating that the methods' "best guesses" are different. Figure 4 shows in what way the graphs found in those stable regions differ, with LR+S inferring more connections between amino-acids. Here again, a similar result was obtained in the Mother cohort (see suppl. mat.). The supplementary materials also contain the full name of the metabolites being modeled.
Given that two cohorts are at our disposal, one way of assessing the quality of our results is to look at how well they replicate across datasets. In Figure 5(a), we plot the similarity between graphs estimated at the same value of γ (see suppl. mat. for a plot of this similarity for all possible pairs γ 1 , γ 2 .). First, it can be seen that higher replication values are achieved in the stable regions of their respective methods, with Jaccard Indices at 0.6 or above. We also see that LSCGGM's edge set replicates better than LR+P's. Moreover, the suggested estimator retrieves more edges under the condition E(V ) ≤ 1 (see suppl. mat.).
Finally, we use the "enrichment statistic" defined earlier. In our attempt to assess the quality of our estimates and their biological relevance, this metric is useful as it makes it possible to score graphs using an external source of information.   For each value of γ and each method, we plot the similarity between the estimate obtained in one cohort against the one obtained in the other. We limit ourselves to values of γ for which the estimates in both cohorts comport  edges or more. (b) Enrichment statistic, as a function of the tuning parameter γ . the dataset, higher values are achieved within the stable regions of their respective methods. Just like in the case of the replication measure, LSCGGM achieves the highest values. Given that the Child cohort contains twice as many samples as the Mother cohort, it is surprising to observe better performances in the Mother dataset. This might be due to the fact that this cohort is more homogeneous: there are women only, measurements were taken the same number of months after pregnancy, etc.

Discussion
We discussed the problem of estimating a conditional Gaussian graphical model in the presence of latent variables. Building on the framework introduced by the authors of Chandrasekaran, Parrilo and Willsky (2012), we suggested an estimator which decomposes the parameters of a sparse conditional Gaussian graphical model into the sum of a low-rank and a sparse matrix. Among other theoretical results, we established that the proposed approach is well-behaved in the high-dimensional regime. Through simulations and an application to a modern dataset comprising genetic and metabolic measurements, we compared the performances of this approach to alternative methods. In particular, we showed how such a conditional graphical model leads to better replication of the results across cohorts and to estimates that are more biologically relevant.
The rise of high-throughput genetics, along with progress in data linkage, biobanking and functional genomics projects, has dramatically increased the number of datasets that include both genetic and multivariate phenotypic data. The data application we present in this article, using genotype data to draw biological conclusions about the relationships between human traits, is thus becoming one of the most rapidly growing statistical challenges in human genetics. Conditional graphical models are particularly well-suited to such problems as they rely on an assumption we know to be true (namely, that genotype impacts phenotype and not vice versa). Moreover, genetic measurements are discrete in nature and it is therefore difficult to model them alongside continuous measurements. To the best of our knowledge, there are no approaches capable of learning a joint distribution over continuous and discrete data in the presence of latent variables.
When it comes to lasso-type estimators, choosing an appropriate value of the tuning parameters can also be challenging. In simulations, our method seems to be less sensitive to the value of the tuning parameter γ , which makes it easier to set it to a suitable value in real life applications. Moreover, the use of (complementary pairs) stability selection makes the estimates less sensitive to the value of λ while providing some form of error control.
Another limitation of such estimators comes from the fact that consistency/identifiability conditions are highly likely to be violated in real-world applications. While this is true, a more realistic take is to regard our method as a means to generate "causal" hypotheses from a high-dimensional dataset. Paired with stability selection, such an approach can realistically be used to generate a high-quality set of putative causal relationships that can then further be investigated using hypothesis testing driven approaches (e.g., instrumental variables). As shown in our application, this is an achievable goal. Naturally, the method suggested here also suffers from a number of limitations and more work is required. For example, assuming that the latent variables are normally distributed appears quite restrictive when compared to the flexibility offered by instrumental variable methods. The question of learning discrete graphical models is also important, but it is not yet clear how the present work can be extended to such models.