Sharp-SSL: Selective high-dimensional axis-aligned random projections for semi-supervised learning

We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivated by a generalized Rayleigh quotient, we score projections according to the traces of the estimated whitened between-class covariance matrices on the projected data. This enables us to assign an importance weight to each variable for a given projection, and to select our signal variables by aggregating these weights over high-scoring projections. Our theory shows that the resulting Sharp-SSL algorithm is able to recover the signal coordinates with high probability when we aggregate over sufficiently many random projections and when the base procedure estimates the whitened between-class covariance matrix sufficiently well. The Gaussian EM algorithm is a natural choice as a base procedure, and we provide a new analysis of its performance in semi-supervised settings that controls the parameter estimation error in terms of the proportion of labeled data in the sample. Numerical results on both simulated data and a real colon tumor dataset support the excellent empirical performance of the method.

A common feature of contemporary semi-supervised learning problems is highdimensionality, since we may record many covariates having a possible association with the labels corresponding to different observations.This represents a significant challenge, as can be seen by considering a simple two-class problem with more covariates than observations.For any given assignment of class labels, if no subset of n 0 observations lies in an (n 0 − 2)-dimensional affine space, then we can find hyperplanes with orthogonal normal vectors, each of which achieves zero training error (in other words, they perfectly separate the classes).Nevertheless, even in the simple setting where the true Bayes decision boundary is linear, many such hyperplanes may be little better than a random guess on test data.
An appealing approach to tackling high-dimensionality is via random projections into lower-dimensional spaces.Such projections may almost preserve the pairwise distances between observations, as seen from the Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss, 1984;Dasgupta and Gupta, 2003).Moreover, in cases where we have reason to believe that only a relatively small proportion of the variables recorded are relevant for the learning task, we can choose our random projections to be axis-aligned in order to preserve this structure.A third benefit is the possibility of aggregating results over multiple random projections, though this must be done with care so as to avoid noise accumulation.These attractions have meant that random projections have now been employed in many high-dimensional statistical problems, including precision matrix estimation (Marzetta, Tucci and Simon, 2011), two-sample mean testing (Lopes, Jacob and Wainwright, 2011), classification (Durrant and Kabán, 2015;Cannings and Samworth, 2017), (sparse) principal component analysis (Yang et al., 2021;Gataric, Wang and Samworth, 2020), linear regression (Thanei, Heinze and Meinshausen, 2017;Slawski, 2018;Dobriban and Liu, 2019;Ahfock, Astle and Richardson, 2021), clustering (Dasgupta, 1999;Fern and Brodley, 2003;Han and Boutin, 2015;Yellamraju and Boutin, 2018;Anderlucci, Fortunato and Montanari, 2022) and dimensionality reduction (Bingham and Mannila, 2001;Reeve, Kabán and Bootkrajang, 2022).See Cannings (2021) for a review of recent developments in the area.
In this paper, we propose a new method, called Sharp-SSL (short for Selective highdimensional, axis-aligned random projections for Semi-Supervised Learning).Our primary goal is to identify a small subset of variables that are particularly helpful for label assignment; existing low-dimensional methods can then be used to complete the learning task.To this end, we generate a large number of axis-aligned random projections, and apply a base learning procedure such as a semi-supervised version of the Gaussian Expectation-Maximization (EM) algorithm to our projected data.Motivated by the notion of a generalized Rayleigh quotient (see (2) below for a formal definition), and to avoid the noise accumulation issue mentioned above, we score the projections by computing the trace of the corresponding estimated whitened between-class covariance matrices.This enables us to assign an importance weight to each variable for a given projection, and we select our signal variables by aggregating these importance weights over the high-scoring projections.See Section 2 for a more detailed description of our methodology.
Section 3 is devoted to a theoretical analysis of our Sharp-SSL algorithm.We first show in Theorem 2 that provided the low-dimensional base learning procedure satisfies a guarantee on the proximity of the estimated whitened between-class covariance matrix to its population analogue, the corresponding high-dimensional semi-supervised learning algorithm can recover the signal coordinates with high probability when we aggregate over sufficiently many random projections.It turns out that both Linear Discriminant Analysis and an EM algorithm are examples of low-dimensional learning procedures that satisfy this proximity guarantee, as we prove in Theorems 3 and 6 respectively.The latter is particularly challenging, and one of the main novel contributions of our analysis is to provide a guarantee on the performance of a d-dimensional Gaussian EM algorithm in a semi-supervised setting.In particular, we control the parameter estimation error in terms of the proportion of labeled data in the sample, showing that with a sample size of n it smoothly interpolates between the (d/n) 1/4 rate for unsupervised learning and the (d/n) 1/2 rate for fully-labeled data, up to logarithmic factors.An advantage of the modular approach to our analysis is that it illustrates the way in which the Sharp-SSL algorithm can be combined with different base learning algorithms to adapt to different problem settings and reflect the preferences of the practitioner.
In Section 4, we study the numerical performance of the Sharp-SSL algorithm.Our first goal, in Section 4.1, is to study the effect of the choices of input parameters to our method, which allows us to recommend sensible default choices for application in our subsequent comparisons.Section 4.2 presents the results of a simulation study involving the Sharp-SSL method, as well as five alternative approaches, on high-dimensional clustering tasks (since not all of the competing methods are able to leverage partial label information).We find that the Sharp-SSL algorithm is able to attain a misclustering rate very close to that of the optimal Bayes classifier, even with only around 50 observations per cluster, in settings where these alternative techniques may perform poorly.In Section 4.3, we investigate the extent to which the different versions of the Sharp-SSL method are able to leverage partial label information.The results here are consistent with the phase transition phenomenon articulated by our theory.Finally, in Section 4.4, we apply the Sharp-SSL algorithm, as well as the other methods from our simulation study, on a colon tumor dataset, where we withhold the true labels from the algorithms in order to assess performance.Our analysis supports the ability of the Sharp-SSL algorithm to identify signal coordinates (genes) that are useful for identifying patients with and without tumors.
In the broader literature on high-dimensional learning problems, a large number of methods have been developed to leverage sparse low-dimensional structures for both clustering (Witten and Tibshirani, 2010;Azizyan, Singh and Wasserman, 2013;Wasserman, Azizyan and Singh, 2014;Azizyan, Singh and Wasserman, 2015;Jin and Wang, 2016;Verzelen and Arias-Castro, 2017;Löffler, Wein and Bandeira, 2022;Löffler, Zhang and Zhou, 2021) and classification (Cai and Liu, 2011;Witten and Tibshirani, 2011;Mai, Zou and Yuan, 2012;Cai and Zhang, 2019).These methods are not designed for partiallylabeled (semi-supervised) settings.Another common approach is to project the data into the span of the top few principal components, and run a standard low-dimensional method such as k-means clustering or the EM algorithm (Butler et al., 2018).This approach can fail if the directions of largest variation in the data are not aligned with the directions separating the clusters.Finally, recent developments in other aspects of semi-supervised learning include self-training (Oymak and Gulcu, 2020), mean estimation (Zhang, Brown and Cai, 2019), choice of k in k-nearest neighbour classification (Cannings, Berrett and Samworth, 2020) and linear regression (Chakrabortty and Cai, 2018).
Proofs of all of our results are provided in Section 5, and we conclude this introduction with some notation used throughout the paper.We write S d×d for the set of d-dimensional symmetric matrices, write S d×d + for the subset that are invertible, and write S d×d K−1 the subset of matrices in S d×d of rank at most K−1.We write R d×d for the set of d-dimensional matrices.For p ≥ d, let O p×d denote the set of p × d matrices with orthonormal columns.The Euclidean norm is denoted by • , and the operator norm of a matrix is denoted by • op , so that A op := sup {x: x =1} Ax .Given two sequences (a n ) and (b n ), we write a n b n when there exists a universal constant C > 0 such that a n ≤ Cb n , and, given an additional problem parameter R, we write a n R b n when there exists C > 0, depending only on R, such that a n ≤ Cb n .
For any set S ⊆ R d and d ≤ |S|, we write S d := {A ⊆ S : |A| = d}.If S ⊆ R d , we define sargmax S to be the smallest element in the argmax in the lexicographic order.For a positive integer k, we define [k] := {1, . . ., k}.For a vector

The Sharp-SSL algorithm
In this section, we describe in detail the Sharp-SSL algorithm for K-class semi-supervised learning, with K ≥ 2. We aim to provide a unified treatment of clustering, semisupervised learning and classification.To this end, we assume that for i ∈ [n], the observation x i ∈ R p has a true label y * i ∈ [K], but it may be the case that we do not observe y * i .Instead, we assume that our observed label y i takes values in [K] ∪ {0}, where y i := y * i when the true class label is observed, and y i := 0 otherwise.Thus, our data can be regarded as (x 1 , y 1 ), . . ., (x n , y n ) ∈ R p ×([K]∪{0}), and our goal is to construct a datadependent classifier1 , i.e. a Borel measurable function with the interpretation that C x; (x 1 , y 1 ), . . ., (x n , y n ) is the predicted class of x ∈ R p .To motivate our Sharp-SSL algorithm, it is instructive first to consider a canonical Gaussian classification problem, where our data can be regarded as n independent realizations of a pair (X, Y ) taking values in R p × [K], with prior probability π k := P(Y = k) for the kth class and denote the between-class covariance matrix, and consider D ∈ O p×(K−1) with a column space spanned by Σ from which we deduce that this likelihood ratio, and hence the Bayes classifier x → argmax k∈[K] P(Y = k | X = x), only depends on x through D x.Thus, for the purposes of classification, no signal would be lost (and the noise would be reduced) if X were replaced with D X.
In high-dimensional settings with p n, the matrix Σ −1 w is not consistently estimable in general, but we can nevertheless make progress if the vectors Σ −1 w (ν 1 −ν), . . ., Σ −1 w (ν K − ν) are sparse.In other words, writing S 0 for the union of the set of coordinates for which these vectors are non-zero, we suppose that |S 0 | p; this is a very common assumption in high-dimensional LDA (e.g.Cai and Liu, 2011;Witten and Tibshirani, 2011;Mai, Zou and Yuan, 2012;Cai and Zhang, 2019).
In such a setting, the column space of D has a sparse basis, so it is natural to consider projecting the data onto a small subset of its coordinates.For d ∈ [p], define the set of axis-aligned projection matrices P d := P ∈ {0, 1} d×p : P P = I d , i.e. the set of binary d×p matrices with orthonormal rows.By the argument above, if d ≥ |S 0 | then there exists P * ∈ P d such that the error of the Bayes classifier is unchanged by projecting the data along P * .In practice, it would typically be computationally too expensive to enumerate through all p(p−1) • • • (p−d+1) axis-aligned projections.Instead, we consider a randomly chosen subset of projections within P d .An axis-aligned projection chosen uniformly at random is unlikely to capture all the signal coordinates S 0 , but by aggregating over a carefully-chosen subset of these random projections, we can nevertheless recover the set of signal coordinates under suitable conditions; see Theorem 2 below.To describe our method for choosing good projections, for V ∈ O p×d , we define the generalized Rayleigh quotient along V by (2) Proposition 1 below motivates seeking to choose projections to maximize the generalized Rayleigh quotient by showing that the column span of any maximizer J(V ; Σ b , Σ w ) over V ∈ O p×d must contain the column space of D.
Proposition 1.Let K ≥ 2 and d ≥ K − 1. Assume that the convex hull of ν 1 , . . ., ν K is (K − 1)-dimensional, and let V * ∈ argmax V ∈O p×d J(V ; Σ b , Σ w ).Then the column space of V * contains the eigenspace corresponding to the K −1 non-zero eigenvalues 2 of Σ −1 w Σ b , which is equal to the space spanned by Based on Proposition 1, a natural conceptual approach to maximizing the generalized Rayleigh quotient is to compute the leading (K − 1)-dimensional eigenspace of Σ −1 w Σ b .This strategy, however, runs into difficulties when we replace these population quantities with their sample versions in the setting of the opening paragraph of this section.More precisely, writing for the sample versions of the within-class and between-class covariance matrices respectively, the matrix Σw is not invertible whenever p > n.Fortunately, though, this issue can be resolved by working with the projected data, as long as we choose d ≤ n − K: the projected data {P X i : i ∈ [n]} has within-class covariance matrix P Σ w P ∈ R d×d 2 Even though Σ −1 w Σ b is not guaranteed to be symmetric, it is similar (i.e.conjugate) to the symmetric matrix Σ , so has real eigenvalues and eigenvectors.and between-class covariance matrix P Σ b P ∈ R d×d , so with probability one, the sample version P Σw P is invertible.
Returning to the general setting of the opening paragraph of this section, then, we seek projections P with large J(P ; Σb , Σw ) = tr (P Σw P ) −1 (P Σb P ) .To this end, for fixed A, B ∈ N, we sample a set of projections {P a,b : a ∈ [A], b ∈ [B]} uniformly at random from P d .For each a and b, we apply a low-dimensional base algorithm

+
to the projected data (P a,b x 1 , y 1 ), . . ., (P a,b x n , y n ) to obtain an estimator Qa,b of (P a,b Σ w P a,b, ) −1 (P a,b Σ b P a,b, ), the whitened betweenclass covariance matrix of the projected data.We assume throughout for convenience that ψ is permutation equivariant in the sense that ψ (Πz 1 , y 1 ), . . ., (Πz n , y n ) = Πψ (z 1 , y 1 ), . . ., (z n , y n ) Π for every permutation matrix Π ∈ R d×d .One choice for the base algorithm is to set  a) .The main rationale for dividing the projections into A groups and selecting one within each group-as opposed to selecting the A projections with the largest values of tr( Qa,b )-is that, conditional on the original data, the selected projections are independent and identically distributed.This facilitates our theoretical analysis by enabling the application of concentration inequalities in the proof of Theorem 2. The diagonal entries of { Qa,b * (a) : a ∈ [A]} measure the importance of the projected variables for the semi-supervised learning task.These can then be converted into importance scores for the original variables by 'back-projecting' into the higher-dimensional space, i.e. by forming the vector ŵ = ( ŵ1 , . . ., ŵp ) ∈ R p given by ŵj := Finally, we rank the variables by their importance scores, and our estimate Ŝ of the set of signal coordinates is given by the largest entries in ŵ, breaking ties arbitrarily if necessary, where ∈ [p] is specified by the practitioner.Pseudocode for the Sharp-SSL procedure is given in Algorithm 1.
After applying Algorithm 1 to obtain an estimated set Ŝ of signal variables, we can then apply any existing semi-supervised learning method for low-dimensional data with input (P Ŝ x i, , y i ) i∈ [n] , where P Ŝ denotes the projection onto the coordinates in Ŝ.

Base learning methods
Algorithm 1 relies on a base learning method for low-dimensional data to estimate the projected whitened between-class covariance matrix from the projected data.When all or almost all of the input data are labeled, we can use the procedure outlined in Algorithm 2, which ignores any unlabeled data, for this purpose.On the other hand, when we have a substantial amount of unlabeled data, Algorithm 2 may be inaccurate.In such circumstances, it may be preferable to use Algorithm 3, which runs an Expectation-Maximization (EM) procedure to predict the unobserved labels and subsequently estimate the whitened between-class covariance matrix.More precisely, from M random initializations of the Algorithm 1: Sharp-SSL: Clustering and semi-supervised learning via ensembles of axis-aligned random projections.
Input: Data (x 1 , y 1 ), . .cluster means and the within-class covariance matrix, Algorithm 3 uses the EM algorithm to update these quantities, and thereby compute the whitened between-cluster sample covariance matrix estimators Q is in best agreement with results from the other EM runs; this is made precise in (6).
The algorithm also allows the practitioner to incorporate prior knowledge about the true cluster means and within-cluster covariance matrices, both through optimizing over a restricted constraint set C in the M step of the EM algorithm, and through the choice of a distribution supported on C for the initialization of these quantities.An alternative to the EM algorithm for unsupervised learning would be to apply k-means clustering as a base procedure.Previous studies have suggested that these approaches have comparable empirical performance (e.g., de Souto et al., 2008;Rodriguez et al., 2019, and references therein), but the EM algorithm is more amenable to theoretical analysis in our setting.

Results for the high-level algorithm
In this subsection, we consider independent triples (X . We recall that Y * i denotes the true label of the ith observation, and that Y i := Y * i if the ith label is observed, and denote the prior probability and the cluster mean of the kth cluster respectively, let ν * := K k=1 π k ν * k denote the weighted cluster mean and let Σ w := Cov(X 1 | Y * 1 = k) denote the common withincluster covariance matrix.With the between-cluster covariance matrix Σ b from (1), our Algorithm 2: Base learning using only labeled data Input: , where the convex hull of (z i ) i: Set n k := |{i : y i = k}| and n := K k=1 n k Compute μk := n −1 k i:y i =k z i (with the convention that μk := 0 if n k = 0).end Compute the within-class and between-class covariance matrices as goal is to estimate the set of signal coordinates, and we write s 0 := |S 0 |.Our first main theoretical result shows that if the base algorithm is accurate on each low-dimensional projection and A is large, then with high probability, all signal coordinates are selected.
Theorem 2. Define γ min := min j∈S 0 (Σ −1 w Σ b ) j,j and γ max := max j∈S 0 (Σ −1 w Σ b ) j,j .Let Ŝ be the output of Algorithm 1 with input K, p, (X 1 , Y 1 ), . . ., (X n , Y n ), A, B, d ≥ s 0 , ≥ s 0 and permutation equivariant base procedure ψ.Write In fact, we can see from the proof of Theorem 2 that the following stronger conclusion holds: for any realization (x i , y i ) i∈ [n] of the data satisfying max we have . Note here that, after conditioning on the data, the probability is taken over the randomness in the projections.An attraction of Theorem 2 is its generality, and in particular the fact that we do not impose strong distributional assumptions -we simply require control of ε in (7).The price we pay for this generality is that the probability bound may be loose in particular cases; for example, the bound holds even with B = 1, though in practice we would expect it to improve as B increases.
Compute μtot := n −1 n i=1 K k=1 L i,k μk and the between-class covariance matrix Output:

Theory for base learning using labeled data
In this subsection, we demonstrate how the high-level result in Theorem 2 can be used to derive performance guarantees for a high-dimensional classification algorithm that uses the Sharp-SSL procedure in Algorithm 1 in conjunction with the low-dimensional base method described in Algorithm 2 for estimating the projected whitened between-class covariance matrix.The following theorem provides uniform control of the output of Algorithm 2 for all axis-aligned d-dimensional projected datasets.
Theorem 3. Fix ε ∈ (0, 1], K ∈ {2, 3, . . ., }, p, d ∈ N with p ≥ d and n ∈ N with n ≥ Kd + 1. Suppose that (X 1 , Y 1 ), . . ., (X n , Y n ) are independent and identically distributed pairs, with and some R 1 > 0, and that Σ w is diagonal and well-conditioned in the sense that max{ Σ w op , Σ −1 then with probability at least 1 − ε, we have The sample size condition (9) can be restated as n R 1 ,R 2 d log p + log(1/ε) + K, so may be regarded as mild.Regarding K as a constant, Theorem 3 confirms that the uniform control of Algorithm 2 is at the parametric rate, up to a logarithmic factor.The following corollary then follows immediately by combining Theorems 2 and 3.
Corollary 4. Fix ε ∈ (0, 1].Suppose that the conditions of Theorem 3 hold, and moreover that λ min (Σ b ) ≥ 1/R 3 for some R 3 > 0. Then there exist C 1 , C 2 > 0, depending only on R 1 , R 2 and R 3 , such that if A, B, and base procedure ψ from Algorithm 2 satisfies Thus, under the conditions of Corollary 4, the Sharp-SSL algorithm can, with high probability, select the signal variables in the top s 0 output variables, provided that the number A of groups of random projections is large by comparison with p 2 .In other words, the algorithm reduces the problem to a low-dimensional one, for which standard learning techniques can be applied.The guarantees for these methods (e.g.Anderson, 2003, Theorem 6.6.1)can then be combined on the high-probability event of Corollary 4 to establish theoretical results for the full procedure.

Theory for semi-supervised based learning
When the proportion of labeled data is low, Algorithm 2 may be inaccurate when used as the base procedure in Algorithm 1.The aim of this subsection, therefore, is to study the base procedure of Algorithm 3, which is able to leverage both the labeled and unlabeled data via an EM algorithm to estimate the whitened between-class covariance matrix for each projected data set.Our analysis builds on several recent breakthroughs in our understanding of the EM algorithm.This line of work includes Balakrishnan, Wainwright and Yu (2017), Daskalakis, Tzamos and Zampetakis (2017), Yan, Yin and Sarkar (2017), Dwivedi et al. (2020a), Dwivedi et al. (2020b), Davis, Diaz and Wang (2021), Ho et al. (2020), Ndaoud (2022), Wu and Zhou (2022) and Doss et al. (2023), all of which focus on the unsupervised case.
For simplicity, we will focus on the setting where independent and identically distributed (X 1 , Y * 1 ), . . ., (X n , Y * n ) are generated from a mixture of two Gaussians with opposite means and identity covariance matrix: (10) We assume that we observe (X 1 , Y 1 ), . . ., (X n L , Y n L ), X n L +1 , . . ., X n for some n L ∈ {0, . . ., n}.In other words, we are given n L labeled observations and n U := n − n L unlabeled ones.Thus, n L = 0 corresponds to the fully unsupervised case, i.e., clustering, while n L = n corresponds to the supervised case, i.e., classification.We define We first study the performance of the EM procedure after the covariates have been projected into a lower-dimensional space.In other words, for some fixed P ∈ P d , define In this setting, we have a single unknown parameter µ * to estimate, and this can be achieved by applying Algorithm 3 to (Z i , Y i ) i∈ [n] with K = 2 and the constraint set After initializing the EM algorithm at some fixed (−μ (0) , μ(0) , I d ) ∈ C, for t ∈ N, the tth iterate of the EM iteration described in (4) and ( 5) is (−μ (t) , μ(t) , I d ), where see Lemma 15.Since we allow n L = 0, where µ is only identifiable up to sign, and since the between-class sample covariance matrix Σb computed in Algorithm 3 is equal to Σb = μ1 μ 1 − μtot μ tot , which is invariant to flipping the signs of μ1 and μ2 simultaneously, it is natural to consider the loss function Proposition 5 below provides a theoretical guarantee for this semi-supervised EM algorithm.For notational simplicity, we define γ := n L /n, ω 0 := {d log n + log(1/δ)}/n U and ζ 0 := min{ω 0 γ −1/2 , ω 1/2 0 } throughout this section.Thus, treating d as a constant and ignoring polylogarithmic terms, ω 0 is of order n is the critical 2 -testing radius for distinguishing the means of two labeled Gaussian distributions with identity covariance using n L observations.On the other hand, as we show in Lemma 16, no test of the null hypothesis H 0 : N d (0, I d ) against the two-component mixture alternative 2e −n , 1] and r ≥ 1, and suppose that µ  *  ≤ r and γ < 1/2.There exists c > 0, depending only on r, such that if ω 0 ≤ c and n ≥ 3, then the following statements hold: (i) For any μ(0) ∈ R d with μ(0) ≤ r + 3, we have with probability at least 1 − 2δ that In order to interpret Proposition 5(i), consider the regime where µ * ≤ ζ 0 .In this case, as discussed above, the two mixture components are essentially indistinguishable, and the bound reveals that the EM algorithm performs no worse than the trivial zero estimator, up to constant factors.On the other hand, part (ii) studies the more interesting regime where the two mixture components are distinguishable, and we establish a faster convergence rate for the EM algorithm in this strong signal regime.
The following theorem combines the two convergence regimes in Proposition 5 to derive a convergence guarantee for the estimated whitened between-class covariance matrix output by Algorithm 3. To state the result, recall the definition of C from (11).For any ζ > 0, we write U (ζ) for the pushforward measure on C induced by Unif(ζS d−1 ) under the map µ → (−µ, µ, I d ).
Theorem 6. Fix δ ∈ (2e −n , 1], and r ≥ 1 and suppose that µ * ≤ r and γ < 1/2.There exists c > 0, depending only on r, such that if ω 0 ≤ min{c, (d log n) −3 } and n ≥ 108, then the sequence of outputs Finally in this section, we study the implications of Theorem 6 for the recovery of the signal coordinates in the semi-supervised learning setting.We write ψ (M,T ) for the base procedure that takes (z i , y ) as input and returns the output of Algorithm 3 when run with these inputs together with C, π C , M and T .Let S 0 denote the set of coordinates where ν * ∈ R p is non-zero, and let s 0 := |S 0 |.
Corollary 7. Fix ε ∈ (8e −n/2 , 1], r ≥ 1, and suppose that µ * ≤ r, M ≥ 50 log(4/ε) + 50d log p and γ < 1/2.Let ν * max := ν * ∞ and let ν * min denote the minimum absolute value of a non-zero component of ν * .There exist C 1 , C 2 > 0, depending only on r, such that if n ≥ C 1 (d log p) 6 {d log p + log(1/ε)}, and A, B and base procedure ψ (M,T ) satisfies lim inf Corollary 7 reveals in particular that, treating ν * max and ν * min as constants and under the stated sample size conditions, we again recover all of the signal coordinates in the top s 0 output entries, provided that A is large by comparison with p 2 .Thus, in this sense, we can achieve a similar guarantee to that provided by Corollary 4, though the number of groups of projections required for a high probability guarantee in Corollary 7 may be significantly larger in settings where the ratio ν * max /ν * min is large.

Numerical studies
Throughout this section, unless otherwise stated, data are sampled from an equal-probability normal mixture as follows: are chosen to be s 0 -sparse and we define the signal-to-noise ratio of the problem to be3 In our numerical studies, we slightly modify Algorithm 3 so that instead of randomly initializing the cluster means and the covariance matrix, we use the output of hierarchical clustering to initialize the EM algorithm as implemented in the mclust R package (Fraley and Raftery, 1998).This allow us to run Algorithm 3 with M = 1.

Choice of tuning parameters
The purpose of this subsection is to investigate the effect of the various input parameters A, B, d and in Algorithm 1, and to recommend sensible default choices.In Figure 1, we plot the misclustering rate with Algorithm 3 as a base procedure in our Gaussian semi-supervised learning setting as each of these parameters varies, for four different SNR levels.After applying Algorithm 1, we obtain our final estimated cluster labels by using Algorithm 3 again on the data projected onto the selected coordinates with a single hierarchical clustering initialization.We then output the predicted labels, computed as ŷi : The panels in Figure 1 reveal that the misclustering rate is quite robust to the choices of A, B and d, and that it is less serious (and may even help) to choose larger-rather than smaller-than s 0 .In particular, it seems that A = 150 suffices for almost optimal performance (though there appears to be some penalty for choosing it to be as small as 50), and B = 75 appears adequate.There is no clear trend on performance with the choice of d, so for simplicity we took d = s 0 in our remaining simulations below.Finally, the misclustering rate appears to decrease as increases, with an elbow in the curve visible at the highest value of the SNR when is set to the true sparsity level s 0 .Of course, if is chosen to be very large, then we will include many noise variables, and the misclustering rate will eventually deteriorate.Nevertheless, the bottom-right panel of Figure 1 indicates that the gain in increasing the probability of including all signal variables may outweigh the penalty of also including more noise variables-as expected, this effect is larger when the SNR is larger.For simplicity we choose = s 0 in our remaining simulations, though we recommend practitioners err on the side of choosing larger .

Comparison with existing methods
In this subsection, we compare the empirical performance of the Sharp-SSL algorithm in high-dimensional clustering tasks with several existing approaches.We apply the Sharp-SSL algorithm using the EM algorithm of Algorithm 3 as a base procedure, with input parameters A = 150, B = 75, d = = s 0 as discussed in Section 4.1, and our final estimated cluster labels are then obtained as described there.
We compare the Sharp-SSL algorithm with five alternative high-dimensional clustering methods: spectral clustering (e.g.von Luxburg, 2007), the 1 -penalized approach of n = 250, p = 200, isotropic n = 250, p = 600, anisotropic Figure 2: Average misclustering rate over 100 repetitions using Sharp-SSL followed by the EM algorithm, as well as using the other methods from Section 4.2.Data are generated from the normal mixture distribution described at the beginning of Section 4 with K = 3 and p = 200 (left) as well as p = 600 (right).The three cluster means are given by µ 1 = a(1, 1, 0, 0 p−3 ), µ 2 = a(−1, 0, 1, 0 p−3 ) and µ 3 = a(0, −1, −1, 0 p−3 ), where the scale a is chosen such that their pairwise distances are all equal to SNR.For isotropic settings (left), Σ w = I p ; for anisotropic settings (right), Σ w = V ΛV , where Λ ∈ R p×p is diagonal with independent Unif[0, 2] diagonal entries and V is independent of Λ, and sampled from the Haar measure on O p×p .The Bayes risk is shown as the gray dashed line.In the top panels, n = 250 and the SNR varies; in the bottom panels, SNR = 3 and n varies.The shaded regions represent interpolated 95% confidence intervals at each of the points.Witten and Tibshirani (2010) and the RPEClus algorithm of Anderlucci, Fortunato and Montanari (2022) as well as a pair of methods that, like Sharp-SSL, apply dimension reduction prior to a low-dimensional clustering algorithm.
In more detail, the spectral clustering approach first constructs a J-nearest neighbour graph adjacency matrix A = (A i,i ) i,i ∈[n] ∈ {0, 1} n×n , where A i,i := 1 if either X i is one of the J = 10 nearest neighbours of X i in Euclidean distance or vice versa, and A i,i := 0 otherwise.It then computes an n × K matrix of eigenvectors associated with the K smallest nonzero eigenvalues of the Laplacian matrix L := D − A, where D ∈ R n×n is a diagonal matrix with diagonal entries D i,i := i ∈[n] A i,i .The final step is to apply the K-means clustering algorithm (Lloyd, 1982), as implemented in the kmeans base R function with 100 random initializations, to the rows of L with the oracle choice of K.
The Witten and Tibshirani (2010) method, which is implemented in the sparcl R package, determines the estimated cluster memberships by maximizing a coordinatewiseweighted between-cluster sum of squares criterion, subject to an 1 constraint on the weights.A permutation approach is used to select the 1 tuning parameter.
In the RPEClus algorithm of Anderlucci, Fortunato and Montanari (2022), we generate B random orthogonal projections and incorporate the d-dimensional projected data as covariates for a linear regression with the orthogonal complement of the projected data as the response.We then use the Bayesian Information Criteria (BIC) from both an application of the EM algorithm to the projected data and the aforementioned regression to identify good projections, and aggregate using the consensus clustering technique of Dimitriadou, Weingessel and Hornik (2002) over the best B * projections chosen according to the sum of the BIC scores.Following the recommendation of Anderlucci, Fortunato and Montanari (2022), we took B = 1000 and B * = 100 as well as d = s 0 .It turned out that this approach had a misclustering rate almost identical to that of a random guess, likely because it did not leverage the sparsity of the signal.We therefore modified this method by generating random axis-aligned projections instead of orthogonal ones, and report this version in our comparison.
The first of the two-stage approaches applies principal component analysis (PCA) to project the data into the oracle choice of K − 1 dimensions (the dimension of the space spanned by the K cluster means); the second uses sparse principal component analysis (SPCA), as implemented in the SPCAvRP algorithm (Gataric, Wang and Samworth, 2020) with inputs A = 600, B = 200, and the oracle choices d = = s 0 , to project into s 0 dimensions.Thereafter, both algorithms apply K-means to the projected data as above.We also explored the option of replacing the K-means steps in these latter algorithms with the EM algorithm, but observed very little difference, so do not report these results here.
Given true labels y 1 , . . ., y n ∈ [K] and estimated labels ŷ1 , . . ., ŷn ∈ [K] from a clustering algorithm, we measure the performance of the algorithm via its misclustering rate, defined as4 L({y 1 , . . ., y n }, {ŷ 1 , . . ., ŷn }) := min where S K is the group of all permutations of [K].In particular, Figure 2 presents the average misclustering rates over 100 Monte Carlo repetitions of the different high-dimensional clustering algorithms described above.Across two different dimensions p ∈ {200, 600}, isotropic and anisotropic settings, and for different values of n and SNR, we see a consistent picture of the Sharp-SSL algorithm combined with EM producing the lowest misclustering rates, often by a large margin.Indeed, for all but the smallest sample sizes or values of SNR, the Sharp-SSL+EM algorithm nearly attains the Bayes risk in all of the problems considered here.

Effect of observed fraction on misclustering rate
One of the key attractions of our procedure is that it offers a unified framework to perform classification or clustering with an arbitrary fraction of labeled observations.In this subsection, we explore the performance of the algorithm as we vary the proportion of observed labels.
Recall that we have two different options for the way in which we implement the Sharp-SSL algorithm to estimate the set of signal coordinates: we can either use only the labeled data, as in the supervised learning approach of Algorithm 2, or we can try to leverage in addition the unlabeled data via the semi-supervised EM approach of Algorithm 3. In the extreme case of this latter version, we have no labeled data, so the algorithm is unsupervised.In Figure 3 we compare the performance of these three methods in both high-and low-dimensional versions of the normal mixture distribution data generation mechanism described at the beginning of Section 4 as the proportion γ of observed labels varies.
More precisely, for the semi-supervised and unsupervised algorithms, we adopt the same implementation of Sharp-SSL as described at the beginning of Section 4.2.The supervised algorithm is very similar, but applies Algorithm 2 in place of Algorithm 3 to select coordinates, and obtains final predicted labels by applying LDA again on the projected labeled data.In cases where the proportion of labeled data was so small that the convex hull of the projected labeled data was less than full-dimensional for every class, we forced Algorithm 2 to return a zero matrix (this only happened when γ was very small).
The top panels of Figure 3 present the results in high-dimensional settings with p ∈ {200, 600}.Since the unsupervised approach has no access to the labels, it has constant misclustering rate.The performance of the semi-supervised approach is always at least as good as that of the unsupervised algorithm, and improves as γ increases.In other words, it effectively leverages the additional information provided by the class labels.When γ is very small, the supervised algorithm-which ignores the unlabeled data-is inaccurate, as it has very little data to work with.On the other hand, its performance also improves as γ increases, and once around 5% of our data are labeled, it outperforms the unsupervised algorithm.Further, it essentially matches the semi-supervised approach when about a third of the data are labeled.We truncate the plot at γ = 1/2 to ensure that we have enough test data on which to compute the misclustering rate.
In the bottom panels of Figure 3, we explore the performance of the three algorithms above in two low-dimensional settings with different values of SNR, in order to provide further insight into the phenomena described in the previous paragraph.Here, we take K = 2 and report the average Frobenius norm loss , where a is chosen such that µ 1 − µ 2 = SNR.Bottom: average Frobenius loss of estimating the (µ 1 , µ 2 ) ∈ R p×2 over 100 repetitions via the semisupervised approach (Algorithm 3), supervised approach (Algorithm 2) and unsupervised approach (Algorithm 3 without using the labels).Top: average misclustering rate over 100 repetitions from applying the above three methods as base algorithms in Algorithm 1.
The shaded regions represent interpolated 95% confidence intervals at each of the points.
dimensional problems, a similar picture emerges: if the proportion of labeled data is small, then the unsupervised algorithm outperforms the supervised one, but this situation may be reversed when γ is larger.The semi-supervised algorithm is able to leverage both the unlabeled and labeled data to obtain the best of both worlds.These empirical observations agree with our theory from Section 3, in particular in the way in which Theorem 6 bounds the accuracy of mean estimation for the semi-supervised algorithm by a minimum of a term that does not depend on γ and one that decreases as γ increases.
It appears that the switch in the minimum occurs around γ = 0.02 in these examples.

Empirical data analysis
In this subsection we apply the Sharp-SSL algorithm, as well as several competing methods, to the gene expression data set from Alon et al. (1999), which contains observations on 62 patients.A preprocessed version of the data can be downloaded from the R package 'datamicroarray' (Ramey, 2016), with a total of 2000 features (genes) measured on 40 patients with colon tumors and 22 without tumors.We first exclude 9 genes to remove perfect collinearity and then standardize each of the remaining p = 1991 columns of the dataset to have unit variance.
We apply the Sharp-SSL algorithm using EM (Algorithm 3) as the base procedure, with input parameters A = 150, B = 75, d = = 5.In addition to our approach (Sharp-SSL+EM), we also compare the performance of the spectral clustering (SC) method, the Witten and Tibshirani ( 2010) method (WT2010, as well as four two-stage methods (PCA+Kmeans, PCA+EM, SPCA+Kmeans, SPCA+EM), where we first reduce dimension of the data to a 5-dimensional subspace using either PCA or SPCA and then apply either the EM algorithm or K-means clustering on the low-dimensional data.For SPCA, we use the SPCAvRP algorithm (Gataric, Wang and Samworth, 2020) with inputs A = 600, B = 200 and d = = 5.The true labels are hidden to all algorithms and are only used to evaluate the final misclustering rate.
Over 100 Monte Carlo repetitions of the randomized algorithms, the Sharp-SSL+EM method had an average misclustering rate of 28.8%, whereas all other competitors had a misclustering rate above 40%, as can be seen from the right-hand data points in Figure 4. To investigate this performance further, we applied each method to a subset of the features.These were constructed from the top = 5 genes identified through Sharp-SSL, together with m = 0, 10, 50, 200 and 600 randomly chosen genes from the remaining 1986.The results are presented as the other data points in Figure 4. We see that the improved performance of the Sharp-SSL+EM relative to the other methods persists, even when only a small number of potentially non-discriminative covariates are present.When m = 0, Sharp-SSL+EM has a slight disadvantage as other algorithms benefit from the ensemble effect of combining two different learning methods; nevertheless it remains competitive.This reinforces the point that the primary contribution of the Sharp-SSL algorithm is to identify signal coordinates that are helpful for semi-supervised learning, and once this task has been accomplished, a variety of low-dimensional procedures are available to the practitioner.for the colon tumor data, using Sharp-SSL followed by the EM algorithm, as well as the other methods described in Section 4.4.The right-hand data points plot the average misclustering rate on the full data set.The other points were obtained by applying each method to a subset of genes formed from the top five genes identified by Sharp-SSL together with randomly sampled genes.The shaded regions represent interpolated 95% confidence intervals at each of the points.
5 Proof of the main results

Proof of Proposition 1
The assumption that the convex hull of ν 1 , . . ., ν Thus, J(V ; Σ b , Σ w ) depends on V only through the column space of Σ 1/2 w V .Moreover, tr(Q AQ) is maximized when Q, or equivalently Σ 1/2 w V , spans a d-dimensional space that contains the (K − 1)-dimensional eigenspace corresponding to the non-zero eigenvalues of A. Note that if for some v ∈ R p \ {0} and λ ≥ 0, we have Av = λv, then Thus, the eigenspace corresponding to the non-zero eigenvalues of A is spanned by Σ and so the eigenspace corresponding to non-zero eigenvalues of Σ

Proof of Theorem 2
We write S a,b := {j ∈ [p] : (P a,b, P a,b ) j,j = 1}.For any S = {j 1 , . . ., j d } ∈ [p]  d , we identify the set S with the sequence j i 1 < . . .< j i d sorted in increasing order; and with a slight abuse of notation, we will use S to refer to either object, which will always be clear depending on the context.We define P S ∈ P d by (P S ) ,j := 1 {j=j } , so that P S, P S = diag (1 {j∈S} ) j∈ [p] .Define Q S := (P S Σ w P S, ) −1 P S Σ b P S, ∈ R d×d and QS := ψ (P S X i , Y i ) i∈[n] ∈ R d×d .Note that Qa,b = QS a,b in this notation, and we will similarly denote Q a,b := Q S a,b for simplicity.Recalling the definition of Ω from (8), in our new notation, and recalling that ψ is permutation-equivariant, we can write , and have P(Ω) ≥ 1 − ε by ( 7).We will work on the event Ω throughout the remainder of the proof, and assume also that γ min > 0, because otherwise the conclusion is trivial.By Weyl's inequality (Weyl, 1912;Stewart and Sun, 1990, Corollary IV.4.9), we have on Ω that On the other hand, we note that tr(Q S ) = j∈S∩S 0 (Σ −1 w Σ b ) j,j .Therefore, by the triangle inequality, for any S, S ∈ [p]  d such that S ∩ S 0 is a proper subset of S ∩ S 0 , we have on Ω that tr( QS ) Fix a ∈ [A], and for any j ∈ [p], define q j := P j ∈ S a,b * (a) | (X i , Y i ) i∈ [n] .Now fix some j ∈ S 0 and j ∈ [p] \ S 0 .We claim that (q j − q j )1 Ω ≥ 0. (16) Now let F : S b,j → S b,j be defined as F (S a,1 , . . ., S a,B ) := S a,1 , . . ., S a,b−1 , f (S a,b ), S a,b+1 , . . ., S a,B .
We claim that F is both well-defined and injective on Ω.For the first of these claims, we note that since j ∈ S a,b , we must have j ∈ f (S a,b ).Moreover, if (S a,1 , . . ., S a,B ) ∈ S b,j , then b * (a) = b.But (16) holds on Ω, so S a,1 , . . ., S a,b−1 , f (S a,b ), S a,b+1 , . . ., S a,B ∈ S b,j .Hence F is well-defined.For the second claim, suppose that S 1 , S 2 ∈ We deduce that f is injective on {S : j ∈ S}.Since j ∈ S a,b for (S a,1 , . . ., S a,B ) ∈ S b,j , this establishes the injectivity of F .In particular, |S b,j | ≤ |S b,j |.Consequently, on Ω, we have for all b ∈ [B] that which implies Claim (15).We remark that one consequence of ( 15) is that, since d ≥ s 0 , we have on Ω that  15) and ( 17), we have on Ω that Now, let a, j, j be freely varying again.Since ŵj = A −1 a∈[A] [P a,b * (a), Qa,b * (a) P a,b * (a) ] j,j , on Ω we have for any j ∈ S 0 and j / from ( 18).Since ≥ s 0 , we have by Hoeffding's inequality that on Ω, as desired.

Proof of Theorem 3
The main ingredient of the proof of Theorem 3 is the following proposition, which controls the rate of convergence of the sample between-and within-class covariance matrices to their respective population versions in a classification problem.
Proposition 8 (Rate of convergence for LDA).Suppose that are independent and identically distributed data-label pairs, such that and let Σw and Σb be computed as in (3) applied to and Σ w op ≤ R 2 for some R 1 , R 2 > 0, then for every δ ∈ (0, 1/4], we have with probability at least 1 − δ that Proof.We first control the rate of convergence of Σb .For We will control the three terms on the right-hand side above separately.For the first term, we note that For the second term, we first apply McDiarmid's inequality again to see that with probability at least 1 − δ/4, we have Thus, we have with probability at least 1 Finally, for the third term, we write μk := i:Y i =k Z i and note that where V := (V 1 , . . ., V K ) , and where μ : a formal definition is given just before Lemma 17.For any fixed u ∈ S d−1 , we have by Muirhead (2009, Theorem 10.3.6) that where χ 2 r (λ) denotes a non-central chi-squared distribution with r degrees of freedom and non-centrality parameter λ.By Birgé (2001, Lemma 8.1), for every δ ∈ (0, 1/2], we have with probability at least 1 − 2δ conditional on Y 1 , . . ., Y n , that Let N be a 1/4-net of the sphere S d−1 , which can be chosen to have cardinality at most 9 d (Vershynin, 2012, Lemma 5.2).Hence, through a union bound, and taking δ := δ/(8•9 d ), we have with probability at least 1 Combining ( 19), ( 20), ( 21) and ( 22), we have that the desired bound on Σb − Σ b op occurs on an event with probability at least 1 − 3δ/4.We now turn to control Σw − Σ w op .Let Q : where Z := (Z 1 , . . ., Z n ) .It therefore follows again by Lemma 17 that Σw | Y 1 , . . ., Y n ∼ n −1 W d (n − K, Σ w ).Another application of Muirhead (2009, Theorem 10.3.6) then yields for any u ∈ S d−1 that By Laurent and Massart (2000, Lemma 1), we have with probability at least 1 Again, taking a union bound over the 1/4-net N , we conclude that with probability at least 1 − δ/4, we have as desired.(P X i − μY i ,P )(P X i − μY i ,P ) and Σb,P : Observe that since n ≥ Kd + 1, we have max k∈[K] n k ≥ d + 1, so Σw,P is positive definite with probability 1.Thus, by the triangle inequality, for each P ∈ P d , we have By Proposition 8 and our hypothesis, there is an event Ω P with probability at least 1 − δ, on which Thus, for the first term in ( 23 where we used (24) in the penultimate inequality.For the second term in ( 23), we also have on Ω P that The desired result follows by combining ( 25) and ( 26), and using the fact that log(1/δ) ≤ d log(ep/d) + log(1/ε).

Proofs of Proposition 5 and Theorem 6
In the proof of Proposition 5, we show the convergence of the EM iterates μ(t) by analyzing their components parallel and orthogonal to µ * separately.Writing η := µ * / µ * , let where ξ t ∈ S d−1 is orthogonal to η.Our proof will combine several propositions that control α t and β t under different conditions.We begin by laying some groundwork and defining some quantities that will be used throughout this subsection.First, it will be convenient to relabel the two classes as {−1, 1} instead of {1, 2}.By the rotational symmetry of the problem, we may assume without loss of generality that µ * = (s, 0, . . ., 0) ∈ R d for some s ≥ 0, and that the first n L observations are labeled (i.e., Y i = 0 for i ∈ [n L ]).We assume throughout this section that s ≤ r and r ≥ 1.

Let μn
with the convention that μn L := 0 if n L = 0, and define the function with f n U := 0 if n U = 0. Throughout, and without further comment, we assume that n = n L + n U ≥ 2. In this notation, the EM update ( 12) can be rewritten, defining the function g n : R d → R d , as The corresponding population quantities are For ω, φ > 0 and r ≥ 1, we define the following two events that control the terms in the EM iteration involving the unlabeled and labeled data respectively: Proposition 9.There exists C r > 0, depending only on r, such that for any δ ∈ (2e −n , 1] , we have P Ω 1 (ω) c ≤ δ.Moreover, for any δ ∈ (0, 1] and for Birgé (2001, Lemma 8.1), we have with probability at least 1 − δ/2 that sup Also, by a very similar argument as in the proof of Wu and Zhou (2022, Theorem 4), we have with probability at least 1 for some C r > 0 depending only on r.The first claim follows by combining ( 31) and (32).
For the second claim, we have μn Hence, by Laurent and Massart (2000, Lemma 1), we have as required.
For any a ∈ R, b ∈ [0, ∞) and ξ ∈ S d−1 that is orthogonal to η, we define F (a, b) := η f (aη + bξ) and G(a, b) := (I d − ηη )f (aη + bξ) .Note that the distribution of Z 1 is orthogonally invariant along the axis µ * ; in other words, if P ∈ R d×d is orthogonal and has µ * as an eigenvector with eigenvalue 1, then P Z 1 d = Z 1 .It follows that f (aη + bξ), and hence F (a, b) and G(a, b), do not depend on ξ.We remark that for some ξ t+1 ∈ S d−1 that is orthogonal to η.
The following result bounds the magnitude of the signal component, α t , of the EM iterates.
Next, we show that provided the initialization is not too uncorrelated with the true parameter, |α t | reaches a level that makes Proposition 12 applicable after a sufficient number of iterations.
Proposition 13.Assume that n ≥ 3, that φγ 1/2 ≤ ω ≤ min{1/12, 1/(r + 3)} and that γ ∈ [0, 1/2).Suppose that μ( 0 Proof.By flipping the sign of µ * if necessary, we may assume without loss of generality that α 0 ≥ 0. Assuming that the desired result is not true, we will prove by induction that on Ω for all t ≥ 0. We show this by first verifying the base case of (a), then proving that (a) implies (b) for each t, and finally proving that α t+1 /β t+1 ≥ 1/(60 √ d log n U ) once (b) holds for a given t.
For the base case, from the assumption on μ(0) , we have since c ≤ 60.Now assume that α t /β t ≥ c /(60 √ d log n U ) and that α 0 ≤ α t < c 5 s for some t ≥ 0. We aim to show that (b) holds for the same t, and start by controlling e 1 f (μ (t) ).Let W = (W 1 , . . ., W d ) be an independent copy of Z 1 , independent of all other randomness in the problem, and define Then by applying the second part of Lemma 18 with a = W 1 α t and b = W −1 μ(t) −1 (so that a + b = W μ(t) ), we have where in the final step we have used the fact that u t is an odd function of W −1 μ(t) −1 , which has a symmetric distribution about 0, conditional on (W 1 , μ(t) −1 ), and hence E(W −1 ) = 0. From the assumption s ≥ c 4 ζ, and using β t ≤ 60(1 + r)ζ from Proposition 10, we have for sufficiently large c 4 that . By choosing c 5 > 0, depending only on r, sufficiently small, we may assume that α 2 t (1 + 2s 2 + s 4 /3) < c 2 5 (1 + 2r 2 + r 4 /3)s 2 ≤ s 2 /4.Recall the definition of f n U from (28).Since on the event Ω 1 (ω) ∩ Ω 2 (φ), we have μ(t) ≤ r + 3 ≤ 2(r + √ d) as in the first line of the proof of Proposition 10, we have on the event Ω 1 (ω) ∩ Ω 2 (φ) that Hence, in either case, we have on Ω 1 (ω) ∩ Ω 2 (φ) that where the final bound holds provided we reduce c 5 to be at most 1/(2+124/c ) if necessary.Now, when γ ≤ ω √ d log n U , we have ζ = ω 1/2 ∧ωγ −1/2 ≥ ω 1/2 (d log n U ) −1/4 and hence by the condition on s in the proposition, we have Thus, by increasing c 4 to be at least 4 + 248/c if necessary, we have (1 + 62/c )ω √ d log n U ≤ s 2 /4.Hence, in this case, and on the event Ω 1 (ω) ∩ Ω 2 (φ), On the other hand, when γ > ω √ d log n U , we have Combining the two bounds above proves (b) for this given t.
To prove Theorem 6, we need the following proposition, which relates the loss of estimating µ * to the operator norm loss of estimating µ * µ * .Proposition 14. Assume that n ≥ 3 and that . For any δ ∈ (0, 1) and B > 0, we have with probability at least 1 − δ that Thus, it is enough to show that sup µ: µ ≤B n −1 n i=1 L i (µ) (s 2 ∨1){d log(2Bn+e)+log(1/δ)} n with probability at least 1 − δ.To this end, we have sup For the first term on the right-hand side of (60), by Hoeffding's inequality, we have For the second term on the right-hand side of (60), let N be a ε-net of {v : v ≤ B} with respect to the Euclidean distance, for some ε ∈ (0, 1/2] to be specified later.Since a maximal ε-packing set is an ε-net, we may assume that Using the fact that x → tanh x is 1-Lipschitz, together with the Cauchy-Schwarz inequality, we have Hence taking ε = 1/n, and defining τ := log(3/δ) d log(2Bn+e) > 0, we have where the penultimate bound uses Hoeffding's inequality and the Cauchy-Schwarz inequality and the final bound uses the fact that n Birgé (2001, Lemma 8.1).Combining (60), ( 61) and (62), we have with probability at least 1 − δ that sup as desired.

μ(T)
[m] for the T th (final) iterate of the EM update in Algorithm 3 starting from the mth random initializer μ(0) [m] .Let ω = C r ω 0 , φ = C r φ 0 , Ω 1 (ω) and Ω 2 (φ) be defined as in the proof of Proposition 5. Further, let Σ b (µ) be defined as in Proposition 14.By Proposition 9, the first claim in the proof of Proposition 10 and Proposition 14, we have for some C > 0 depending only on r that In this balanced two-cluster setup, for the tth EM iteration starting from the mth random initializer, we have −μ 1 = μ2 = μ, where we suppress the dependence on t and m for convenience.For i ≥ n L + 1, we have L i,1 = e z i μ1 /(e z i μ1 + e z i μ2 ), L i,2 = e z i μ2 /(e z i μ1 + e z i μ2 ) and hence L Consequently, using the notation of Proposition 14 and Algorithm 3, we have Q . Thus, by Proposition 14, with probability at least P Ω 1 (ω)∩Ω 2 (φ) −δ ≥ 1 − 3δ, we have lim sup We now turn to the case where µ * > ω are independent, and moreover, by Lemma 21 we have Defining Ω 3 := {M 0 > M/2}, by Hoeffding's inequality, we have Thus, on the event Ω 1 (ω) ∩ Ω 2 (φ) ∩ Ω 3 , we can let m := min M 0 ∩ (M 1 ∪ { m}) , so by definition of m, we have Since ω 0 ≤ (d log n) −3 , by discussing cases of γ < ω, ω ≤ γ ≤ ω 2/3 and γ > ω 2/3 , we see that ω From the proof of Proposition 5(ii), we have on the event Ω Let Ω 4 be the event on which the conclusion of Proposition 14 holds.Then on Ω The desired result follows by combining ( 63) and ( 64), and the fact that

Proof of Corollary 7
Proof of Corollary 7. Fix P ∈ P d , define Z −n , and µ * ≤ ν * ≤ r.Let c > 0 be chosen, depending only on r, to satisfy Theorem 6.By increasing C 1 > 0, depending only on r, if necessary, we may assume that ω 0 ≤ min{c, (d log n) −3 }.Hence, since (P Σ w P ) −1 P Σ b P = µ * µ * , we can apply Theorem 6 to obtain that for some C 2 > 0 depending only on r, with probability at least 1 − 3δ − e −M/50 we have that lim sup Since ψ (T ) is permutation equivariant for each T ≥ 0, by Fatou's lemma and a union bound, we have that lim sup The result now follows from Theorem 2, noting that γ min = (ν * min ) 2 and γ max = (ν * max ) 2 .
Proof.Write X = (X 1 , . . ., X n ) .Observe that, writing d TV for the total variation distance between probability measures, P H 0 ψ(X) = 1 + P H 1 ψ(X) = 0 ≥ 1 − d TV (P H 0 , P H 1 ) To control the chi-squared divergence in the right-hand side of (65) above, we let ξ = (ξ 1 , . . ., ξ n ) have independent Rademacher components and W = (W i,j ) i∈[n],j∈[d] be a random matrix with independent N (0, 1) entries, independent of ξ.Then X d = W under H 0 and X d = ξµ * + W under H 1 .Let ξ be an independent copy of ξ.Using the Ingster-Suslina device, see, e.g., Ingster and Suslina (2012), Liu, Gao and Samworth (2021), Lemma 21, we have that dP H 1 dP H 0 2 dP H 0 = E exp ξµ * , ξµ * = cosh n ( µ * 2 ) ≤ e n µ * 4 /2 ≤ e 1/2 , where we used the fact that cosh x ≤ e x 2 /2 for all x ∈ R in the penultimate step.The desired result follows from substituting the above bound into (65) and the fact that 1 − (e 1/2 − 1) 1/2 /2 > 1/2.We prove a generalization of Cochran's theorem for quadratic forms of independent Gaussian random vectors with a common covariance matrix, which result in independent noncentral Wishart distributions.Recall that if X is a matrix, then vec(X) is the vectorization of X, obtained by stacking its columns on top of each other.The Kronecker product between matrices A = (A i,j ) i∈[m],j∈ [n] and B is defined as Recall also that when X 1 , . . ., X n iid ∼ N d (0, Σ), the matrix n i=1 X i X i has a d-dimensional Wishart distribution with n degrees of freedom and covariance matrix Σ ∈ S d×d , denoted W d (n, Σ).More generally, n i=1 (X i +µ i )(X i +µ i ) has a non-central Wishart distribution with n degrees of freedom, covariance matrix Σ and non-centrality matrix Ω = n i=1 µ i µ i , written W d (n, Σ; Ω).Thus W d (n, Σ; 0) Lemma 17.Let Z 1 , . . ., Z n be independent with Z i ∼ N d (µ i , Σ) for i ∈ [n], and write Z := (Z 1 , . . ., Z n ) ∈ R n×d and M := E(Z).If P 1 , . . ., P k ∈ R n×n are positive semidefinite matrices such that P 1 + • • • + P k = I n and rank(P 1 ) + • • • + rank(P k ) = n, then Z P 1 Z, . . ., Z P k Z are independent with Z P r Z ∼ W d rank(P r ), Σ; M P r M .
Proof.As in the proof of the classical Cochran's theorem (Cochran, 1934), we first note that P 1 , . . ., P k can be simultaneously diagonalised such that P r = QD r Q , for some Q ∈ O n×n and D r = diag (1 {j∈Sr} ) j∈ [n] , where S r ⊆ [n], |S r | = rank(P r ) and S r ∩ S r = ∅ for all r = r .In particular, P 1 , . . ., P k satisfy P 2 r = P r for r ∈ [k] and P r P r = 0 for all r = r .Since P 1 Z, . . ., P k Z are jointly Gaussian, with Cov vec(P r Z), vec(P r Z) = Cov (I d ⊗ P r )vec(Z), (I d ⊗ P r )vec(Z) = (I d ⊗ P r )(Σ ⊗ I n )(I d ⊗ P r ) = 0, we have that P 1 Z, . . ., P k Z are independent.But Z P r Z = (P r Z) P r Z, so it follows that Z P 1 Z, . . ., Z P k Z are independent.Moreover, writing W = (W 1 , . .as desired. For the second inequality, since both sides are even functions of both a and b, we may again assume without loss of generality that a > 0 and b ≥ 0. We may also assume that b ≤ 1 since otherwise, the right-hand side is negative and the inequality holds trivially.Lemma 19.Let H : [0, ∞) → [0, ∞) be an increasing, concave function with H (x 0 ) < 1 for some x 0 ≥ 0 and either H(0) > 0 or both H(0) = 0 and H (0) > 1.Then there exists a unique α * > 0 such that Moreover, if α 0 > 0, then the sequence (α t ) t≥0 given by α t := H(α t−1 ) monotonically converges to α * .

Figure 1 :
Figure 1: Average misclustering rate over 200 repetitions in our anisotropic Gaussian semi-supervised learning problem with n = 250, p = 600, s = 4, K = 3, γ = 0.05, SNR ∈ {2.5, 3, 3.5, 4} and Σ w = V ΛV , where Λ ∈ R p×p is diagonal with independent Unif[0, 2] diagonal entries and V is independent of Λ, and generated according to the Haar measure on O p×p .For each of the four panels, we fix three of d = 4, = 4, A = 150, B = 75, and vary the remaining one.The shaded regions represent interpolated 95% confidence intervals at each of the points.

0Figure 3 :
Figure 3: Effect of label fraction on performance of supervised, semi-supervised and unsupervised Sharp-SSL learning methods.Data are generated from the normal mixture distribution described at the beginning of Section 4 with K = 2 and Σ w= I p , µ 1 = −µ 2 = a(1 s , 0 p−s ) ∈ R p ,where a is chosen such that µ 1 − µ 2 = SNR.Bottom: average Frobenius loss of estimating the (µ 1 , µ 2 ) ∈ R p×2 over 100 repetitions via the semisupervised approach (Algorithm 3), supervised approach (Algorithm 2) and unsupervised approach (Algorithm 3 without using the labels).Top: average misclustering rate over 100 repetitions from applying the above three methods as base algorithms in Algorithm 1.The shaded regions represent interpolated 95% confidence intervals at each of the points.

Figure 4 :
Figure4: Average misclustering rate (over 100 repetitions for randomized algorithms) for the colon tumor data, using Sharp-SSL followed by the EM algorithm, as well as the other methods described in Section 4.4.The right-hand data points plot the average misclustering rate on the full data set.The other points were obtained by applying each method to a subset of genes formed from the top five genes identified by Sharp-SSL together with randomly sampled genes.The shaded regions represent interpolated 95% confidence intervals at each of the points.
claim, define for j ∈ {j, j } and b ∈ [B] the sets S b, j := (S a,1 , . . ., S a,B ) : b * (a) = b, j ∈ S a,b and S b := (S a,1 , . . ., S a,B ) : b * (a) = b .Let f : [p] d → [p] d be a map defined by f (S) := (S \ {j }) ∪ {j} if j / ∈ S and j ∈ S S otherwise.If j / ∈ S a,b and j ∈ S a,b , then f (S a,b )∩S 0 = (S a,b ∪{j})∩S 0 = (S a,b ∩S 0 )∪{j}, so S a,b ∩S 0 is a proper subset of f (S a,b ) ∩ S 0 ; on the other hand, if either j ∈ S a,b or j, j / ∈ S a,b , then f (S a,b ) = S a,b .It follows by (14) that on Ω we have tr( QS a,b ) ≤ tr( Qf(S a,b ) ).