Minimum regularized covariance trace estimator and outlier detection for functional data

In this paper, we propose the Minimum Regularized Covariance Trace (MRCT) estimator, a novel method for robust covariance estimation and functional outlier detection. The MRCT estimator employs a subset-based approach that prioritizes subsets exhibiting greater centrality based on the generalization of the Mahalanobis distance, resulting in a fast-MCD type algorithm. Notably, the MRCT estimator handles high-dimensional data sets without the need for preprocessing or dimension reduction techniques, due to the internal smoothening whose amount is determined by the regularization parameter $\alpha>0$. The selection of the regularization parameter $\alpha$ is automated. The proposed method adapts seamlessly to sparsely observed data by working directly with the finite matrix of basis coefficients. An extensive simulation study demonstrates the efficacy of the MRCT estimator in terms of robust covariance estimation and automated outlier detection, emphasizing the balance between noise exclusion and signal preservation achieved through appropriate selection of $\alpha$. The method converges fast in practice and performs favorably when compared to other functional outlier detection methods.


Introduction
The analysis of functional data is becoming increasingly important in many fields, including medicine, biology, finance, and engineering, among others.Functional data are characterized by measurements that are taken over a continuum, such as time or space, and are often modeled as curves or surfaces.One important aspect of functional data analysis is the estimation of the covariance structure, which describes the dependence between different points on the continuum.The covariance structure is essential for various tasks, such as smoothing, regression, and classification, and is often used to identify functional outliers, which are observations that deviate from the expected behavior.

Motivation
In the multivariate setting, given the random sample x 1 , . . ., x n ∈ R p from a distribution with mean µ and covariance Σ, one approach to robustify the estimation of the mean and the covariance is to use the weighted sample estimators, which assign different weights, w 1 , . . ., w n ≥ 0, to different data points based on their relative importance.In this approach, outliers or noisy data points ought to be given smaller weights, thus reducing their impact on the estimated mean and covariance.Especially, choosing weights w i ∈ {0, 1}, i = 1, . . ., n, with the constraint n i=1 w i = h, for some fixed n/2 ≤ h ≤ n, yields trimmed mean and covariance estimators.For example, one of the most widely used members of this class of estimators is the so-called minimum covariance determinant (MCD) estimator [Rousseeuw and Driessen, 1999].For a given fixed number h, which is roughly half of the sample size n, its objective is to select h (out of n) observations for which the calculated sample covariance matrix has the lowest determinant.In the parametric family of distributions parameterized by its mean and covariance, we employ a similar strategy, however with the objective to minimize the Kullback-Leibler (KL) divergence between the theoretical (underlying data distribution) and the one parameterized by the estimates of mean and the covariance, using corresponding sample estimates calculated at a subset H ⊂ {1, . . ., n}, |H| = h.The KL divergence (also known as relative entropy) between continuous distributions P and Q with densities f P and f Q , respectively, denoted as KL(P Q) := E Q (log(f P (x)/ log(f Q (x))), is a measure of how far the probability distribution P is from a second, reference probability distribution Q Kullback and Leibler [1951], Kullback [1959].The intuitive and straightforward interpretation of the corresponding KL divergence is the expected excess surprise from using P to model the data when the actual distribution is Q.As it involves (the estimation) of the density function, direct use of KL divergence is usually mitigated by the use of appropriate approximations; see e.g.Blei et al. [2017] for more details.Assume now the data {x 1 , . . ., x n } originates from the normal N (0, Σ) distribution, and denote xH and ΣH (x) to be the mean and the covariance of {x i ; i ∈ H}.Denote further ΣNC H (x) = 1 h i∈H x i x i to be the "non-centered" covariance of an h-subset H.The KL divergence between N (0, Σ) and N (x H , ΣH (x)) is KL(N (0, Σ)|| N (x H , ΣH (x))) = tr(Σ −1 ΣNC H (x)) + log(det(Σ −1 ΣH (x))) − p; (1) see Zhang et al. [2023] for more details.As the second term log(det(Σ −1 ΣH (x))) in ( 1) is a determinant of a regular matrix on a logarithmic scale, i.e. sum of the logarithms of the corresponding eigenvalues, we can say that the first term tr(Σ −1 ΣNC H (x)) dominates (1).To support the claim that the minimizer of tr(Σ −1 ΣNC H (x)) approximately minimizes the KL divergence (1), we can examine the second term log(det(Σ −1 ΣH (x))) in a different form.First, observe that log(det(Σ −1/2 ΣH (x)Σ −1/2 )) = log(det( ΣH (y))), where ΣH (y) represents the covariance of standardized observations y i = Σ −1/2 x i : i ∈ H.For y i we have that Cov(y) = I p , indicating that any strongly consistent estimator Σ(y) of Cov(y) admits the representation Σ(y) = I p + ε n M n , where M n is a unit-norm p × p matrix and ε n → a.s.0. This holds in particular for the consistent estimator ΣH (y) = I p + ε n M n , for some unit-norm p × p matrix M n .A slight modification of Jacobi's formula implies that the directional derivative of the determinant in the direction of M n evaluated at the identity equals the trace of M n , i.e. lim εn→0 (det(I p + ε n M n ) − det(I p )) /ε n = tr(M n ).The continuity of the determinant along with applying the identity to the first order Taylor expansion of det( ΣNC H (y)) around the identity yields that for some δ n → a.s.0 det( ΣH (y)) = det( ΣNC H (y)) + δ n = det(I p ) + ε n tr( This result implies that for sufficiently large n, the determinant of the consistent estimator of the covariance of the standardized observations behaves similarly to its trace, thereby reinforcing the claim that minimizing the trace of the non-centered covariance of the standardized observations approximately minimizes (1).Thus, instead of directly minimizing KL(N (0, Σ), N (x H , ΣH (x))) over all subsets |H| = h, we consider the following optimization problem where ΣNC H (y) = 1 h i∈H y i y i is a non-centered covariance of standardized observations {y i = Σ −1/2 x i : i ∈ H}.The corresponding estimators ΣH0 (x) and xH0 of the covariance and the mean are then referred to as multivariate Minimum Covariance Trace (MCT) estimator of the mean and the covariance, respectively.
The optimization problem (2) highlights the need for an appropriate standardization of the data when extending this approach to more general spaces.The standardization ensures that the norm of the standardized object reflects a specific dissimilarity measure of the object from its corresponding mean.In the multivariate setting, where Σ is a positive definite matrix, the norm of the standardized observations y i corresponds to the Mahalanobis distance between the original observation x i and the mean.Thus, we aim to find the H-subset, |H| = h, as a solution to a generalized form of problem (2).However, before delving into the functional setting, it is important to address the limitations of using the trace of raw observations as the objective function.Remark 2.1.If the data were not standardized, the method would prioritize observations with the smallest Euclidean distance from the center of the data cloud.This would introduce a significant bias towards selecting spherical data subsets for covariance estimation.Conversely, if the data already originated from a spherical distribution, or at least exhibited an "approximate" spherical behavior, this drawback would be mitigated.

Notation and preliminaries
Let X be a random function on L 2 (I), i.e. the Hilbert space of square-integrable functions over the interval I.The corresponding covariance function is defined as v(s, t) = E((X(s)−µ(s))(X(t)−µ(t))) and is continuous, symmetric and positive semi-definite.Here, µ(t) = E(X(t)) with t ∈ I denotes the mean of X.Let further C : L 2 (I) → L 2 (I) be the corresponding covariance operator defined by Cx(t) = E ( x, X − µ (X(t) − µ(t))), for x ∈ L 2 (I), t ∈ I.It can be shown that C is a compact, symmetric, positive definite, integral operator with kernel v, and thus allows the representation Ramsay and Silverman [1997] for more insight.Let us for now assume that µ = 0. Mercer's theorem [Mercer, 1909] then guarantees the existence of an orthonormal basis ψ i ∈ L 2 (I), for i ∈ N, and a non-negative sequence In continuation, we will drop the argument t, as the summation and multiplication of functions are naturally defined element-wise.Furthermore, since C is a compact, symmetric operator in L 2 (I), it can be represented by a spectral decomposition where λ i → 0, and the convergence of the partial sums is in the operator norm.ψ i is the i-the eigenfunction of C corresponding to the eigenvalue λ i , i.e., Cψ i = λ i ψ i for i = 1, 2, . . . .This further implies that As C is an integral operator in L 2 (I), it is also a trace-class operator, further implying that i λ i < ∞ [Ramsay and Silverman, 1997].
Let further X 1 , . . ., X n be a functional random sample from L 2 (I).Then, the sample mean X and the covariance operator Ĉ are defined as where for f, g ∈ L 2 (I) the outer product f ⊗ g : For the subset H ⊂ {1, 2, . . ., n}, to be the trimmed sample mean, covariance, and non-centered covariance operators, respectively, calculated at the subset H.We denote further ψi,H to be the i-th eigenfunction of 4 Functional Regularized Minimum Covariance Trace Estimator

Tikhonov regularization
As indicated in Section 2, the extension of the optimization problem (2) involves defining a standardization of a random function X ∈ L 2 (I).Thus, we proceed by finding the standardized Y ∈ L 2 (I), such that it approximately solves In general, the problem has no solution in L 2 as C is not invertible.We thus employ regularization to the corresponding ill-posed linear problem, particularly via classical Tikhonov regularization with a squared L 2 -norm penalty.In more formal terms, given the regularization parameter α > 0, the α-standardization of the zero-mean random function X ∈ L 2 (I), with covariance operator C, is defined as a solution to the optimization problem The Tikhonov regularization problem (3) admits a closed-form solution as see Cucker and Zhou [2007] for more details.By the spectral theorem for compact and self-adjoint operators, we have further giving where, λ j and ψ j , j = 1, 2, . . .are the eigenvalues and the eigenfunctions of C, respectively.The covariance operator , and due to the considerations above it allows the representation for illustration of the relation between α and λ α i,st see Figure 8. Furthermore, Parseval's identity implies

Connection to the Reproducing Kernel Hilbert Space
In an attempt to generalize the concept of the Mahalanobis distance for the functional data in L 2 (I), Berrendero et al.
[2020] proposed to first approximate the random function X ∈ L 2 (I) with the one in the RKHS H(C), where the kernel in question is, in fact, a covariance-kernel, such that the approximation is "smooth enough".More precisely, for X ∈ L 2 (I) and a regularization parameter α > 0, define Proposition 2 in Berrendero et al. [2020] shows that X α ∈ H(C) admits a closed form as X α = C 1/2 X α st .Berrendero et al. [2020] then define the α-Mahalanobis distance between X, Y ∈ L 2 (I) as in Definition 4.1.Definition 4.1.Given a constant α > 0, the α-Mahalanobis distance (w.r.
where X α and • C are as defined above.
The C-norm of a regularized RKHS approximation X α of the process X has the form where, without loss of generality, we assume E[X] = 0.
Adopting the notion of α-Mahalanobis distance as a meaningful dissimilarity measure in this concept, the αstandardization process produces a function in L 2 (I) whose L 2 -norm is indeed equal to such dissimilarity between X and its mean.Thus, for solution X α st of (3), the generalization of the optimization problem (2) is given by where ĈNC H (X α st ) is the non-centered, trimmed sample covariance of the α-standardized observations ) are referred to as Minimum Regularized Covariance Trace (MRCT) estimators of the mean and the covariance, respectively.k H0 is the corresponding consistency factor calculated under the assumption of the Gaussianity of the underlying process.In continuation, for simplicity of the notation, we will drop the argument X if the (non-centered) covariance is calculated at the original sample.

Algorithm
In practice, in the α-standardization process, the true mean and the covariance operator are unknown.Therefore, we tackle the problem by iteratively replacing them with their current robust estimates, thus yielding an implicit equation for obtaining the optimal subset where i ∈ H. k H0 is the consistency factor of ĈH0 determined under the assumption of Gaussianity.In continuation, in order to emphasize on the dependency of X α i,st,H on k, we write Equation ( 6) thus implies that we search for the h-subset H 0 used in the estimation of the covariance operator in the set of fixed points of a function )).Then f has at least one fixed point.
To estimate the scaling factor k H0 of ĈH0 , we employ the strategy used in Rousseeuw [1984], thus matching the median of the obtained squared α-Mahalanobis distances with the median of the corresponding limiting distribution, under the assumption of Gaussianity.For that purpose, we use the following two results.
Corollary 5.1 then implies that k H0 could be estimated by matching the sample median of the squared α-Mahalanobis distances (i.e.X k H 0 ,α i,st,H0 2 , i = 1, . . ., n) with the estimate of the theoretical median of the weighted sum of independent χ 2 (1) random variables, as described in the result.However, as eigenvalues λ i , i ≥ 1 of the true covariance are unknown, we estimate them with the eigenvalues k H0 λi,H0 , i ≥ 1 of the current robust estimate k H0 ĈH0 of the covariance operator and choose the scaling parameter k such that med X α,k i,st,H0 Thus, for H 0 ⊂ {1, . . ., n} we determine k H0 by solving the implicit Equation (7).
The strategy described above for, given α > 0, finding the solution to ( 6) is presented in Algorithm 1, to which we refer to as Minimum Regularized Covariance Trace (MRCT) algorithm.
Algorithm 1: MRCT algorithm Input :Sample {X 1 , . . ., X n }, an initial subset H 1 ⊂ {1, . . ., n} with |H 1 | = h, regularization parameter α > 0, and the tolerance level ε k > 0; do The output of the MRCT algorithm is a (possibly not unique) solution to the implicit Equation ( 6).Therefore, the MRCT algorithm potentially yields multiple outcomes based on different initial subsets.These then provide a range of options for selecting the optimal robust covariance and α-Mahalanobis distance estimators.In the following, we explore methods for effectively choosing among the candidates generated by the algorithm.
The idea is indeed intuitive since the α-Mahalanobis distances should be small for regular observations.The discussed criterion could be modified by taking the sum of the first q% of α-Mahalanobis distances, for some q, where the considered percentage must not be too high, in order not to include outlying observations (e.g.q = 75%).
ii) An alternative (data-driven) strategy for choosing the optimal solution to (6) would focus on searching for such a subset H ∈ X for which the obtained α-Mahalanobis distances group into two clusters corresponding to non-outlying and outlying observations.If the outliers would be known, one would prefer choosing such a data subset for which the α-Mahalanobis distances M α,k H 0 ĈH 0 (X i , XH0 ), i = 1, . . ., n, when grouped based on outlyingness status, have a maximal ratio of the between-to within-cluster variability.As the class labels are unknown, an approach is to consider an unsupervised classification instead, where the objective is the performance measure of the chosen clustering algorithm, e.g. the ratio of the between-to within-cluster variability (see Example 7.2), or e.g. the Davis-Bouldin index [Davies and Bouldin, 1979].For such purpose, one could also e.g.use the squared excess kurtosis, i.e. the squared standardized fourth moment of the α-Mahalanobis distances.The kurtosis of the univariate projections is widely used as a projection index in independent component analysis (ICA) and clustering; see e.g.Radojičić et al. [2021], while the use of kurtosis as a projection index in outlier detection is discussed in more detail e.g. in Peña and Prieto [2001].If outlying curves are expected, higher squared excess kurtosis could indicate a better separation between regular and irregular observations.We emphasize that selecting the appropriate value of α can significantly reduce the sensitivity of the method to the initial approximation.Consequently, it is observed that the choice of H opt ∈ X has minimal influence on the final outcomes, including the covariance estimate and outlier detection.For more details see Appendix A.

Automated selection of the regularization parameter α and the subset size h
In the following, we propose an automated approach for selecting the most suitable regularization parameter α.We commence by demonstrating the degree of continuity exhibited by Algorithm 1 in relation to α. Berrendero et al. [2020, Proposition 9] show that for a sequence α n > 0, with α n → α as n → ∞, X st αn → X st α .The subsequent proposition establishes the validity of the statement even when the standardization of X is performed using the sample covariance operator, which is computed based on a specific subset H 0 ⊂ {1, . . ., n}.
Proposition 5.2.Let X ∈ L 2 (I), and let k H0 ĈH0 and XH0 , H 0 = H 0 (n) → ∞, as n → ∞ be strongly consistent estimators of the covariance operator C and mean µ = 0, respectively.Then, for a sequence Proposition 5.3 implies that the crucial step in the Algorithm 1 is continuous w.r.t.α.More precisely, if α 1 and α 2 are close enough (so that the ordering of the Mahalanobis distances in STEP 8 of Algorithm 1 is preserved; see Proposition 5.3), given the same initial estimate H 0 , using either α 1 or α 2 would produce the same updated subset H 1 .
Proposition 5.3.Let X ∈ L 2 (I), and let k H0 ĈH0 and XH0 be the H 0 -subset sample estimators of the covariance operator C and mean µ = 0, respectively.Then, for every ε > 0, there exists δ > 0, such that for α 1 , α 2 > 0, The proof of Proposition 5.3 follows the reasoning in the proof of Proposition 5.2 and is thus omitted.
Remark 2.1 argues that the use of the trace of the covariance of non-standardized observations as an objective results in spherically-biased subsets, where such an obstacle is mitigated by first standardizing the multivariate data so that the obtained covariance is proportional to the identity.As no random function in L 2 (I) has an identity covariance, the approach is to choose α such that the eigenvalues of C(X α st ) are either as close to 0 (for the lower part of the spectrum), or as close to some non-zero value, making them also mutually as close together as possible (for the upper part of the spectrum).More precisely, the aim is to select α so that the eigenvalues λ α i,st of C(X α st ) cluster into two groups, where the center of one of those groups is 0, and the joint within-cluster sum of squares is minimal.The approach amounts to minimizing the noise in the data, while at the same time making all the signal components "equally important"; see Figure 8 in the Appendix for illustration.On the sample level, for fixed α > 0 and the robust estimate of the covariance k H0 ĈH0 (X α,k H 0 st,H0 ), we first calculate λα i,st,H0 = (k H0 λi,H0 ) 2 /(k H0 λi,H0 + α) 2 , where λi,H0 is the ith eigenvalue of ĈH0 (X Observe first that as λ1,H0 ≥ λ2,H0 ≥ • • • , and due to the monotonicity of x → x 2 /(x + α) 2 , ordering is preserved for the standardized eigenvalues.Thus, for any c A1 > 0, there exists i 0 ∈ N, such that ( λα i,st,H0 − c A1 ) 2 > ( λα i,st,H0 ) 2 , implying that the set S 1 is finite, and the problem of finding an optimal partition amounts to finding m α > 0 such that The value of an objective function used for α-selection is then , and we search for α ∈ arg min g(α); see left plot of Figure 1 for illustration.The approach is to optimize g iteratively starting with an initial value of α 0 .For an initial alpha, robust estimates of eigenvalues of C are calculated and the value of the objective function g is approximated for each α considered in the optimization.The optimal α found that way is then used for re-estimation of the robust eigenvalues of C. The process is repeated until convergence; see Algorithm 2. In the simulation study (Section 6) we use α 0 = 0.01 for p < n (see simulation study in Berrendero et al. [2020] for more insight), and α 0 = 1, for p > n.However, we noticed that the initial value of α 0 has little to no effect on the output of Algorithm 2. We emphasize that, due to Proposition 5.3, there is no need to run Algorithm 2 on a very dense grid of α's, thus additionally lowering the computational burden.
Algorithm 2: Algorithm for choosing the regularization parameter α Input :Sample {X 1 , . . ., X n }; initial approximation For α i > 0, run MRCT algorithm (1) and obtain robust estimates {k H0 λαi 1 i,H0 : i ≥ 1} of eigenvalues of C; Finally, we provide some thoughts about the choice of the subset size h.It is clear that the number of contaminated samples should be smaller than n − h, and that h should at least be n/2 to obtain a fit to the data majority.Therefore, the choice of h is a trade-off between the robustness and efficiency of the estimator; see discussion in Rousseeuw [1984] for more details.In practice, the proportion of outliers is, in general, not known in advance, and thus we consider a data-driven approach for the choice of h, as suggested by Boudt et al. [2020]: one considers a range of possible values for h to search for significant changes of the computed value of the objective function or the estimate.Big shifts in the norm of the difference between the subsequent covariances estimates, as well as those in the value of the trace of the corresponding covariance based on the consecutive subset sizes, can imply the presence of outliers; see the right plot of Figure 1 for illustration.)) Figure 1: The left plot displays a typical trajectory of the objective for selecting α for Model 1 (n = 200, p = 100, c = 0.2) in the simulation study in Section 6.The right plot shows the objective in (8) with respect to subset size n/2 ≤ h ≤ n, corresponding to the setting on the left and α ≈ 0.6.A sudden increase is noticeable around a value of h = 160, and corresponds to the moment where the first outliers would be included in the subset.

Sparsely observed data
In practice, one does not observe smooth functions, but rather discrete sets of the underlying function values X i (t i ) : t i ∈ {t i,1 , . . ., t i,pi }, i = 1, . . ., n, where p i,j ∈ I, for i = 1, . . ., n, j = 1, . . ., p i .If p 1 = • • • = p n = p and t i,k = t j,k for all i, j = 1, . . ., n, and k = 1, . . ., p, we say that the data is observed on a regular grid.Although there is no precise definition for "dense" functional data, it is conventionally understood that functional data is considered densely sampled when the quantity p i , i = 1, . . ., n tends to infinity rapidly enough.If that is the case, then the functional inner products in Algorithm 1 can be approximated by the corresponding integral sums, where the accuracy of the approximation depends on p; see e.g.Ostebee and Zorn [2002] for more details on integral approximations.Given that the data is densely observed, robust estimates of mean and covariance can be approximated empirically at the observed times t i , i = 1, . . ., p. Robust estimates of the mean and covariance operator on the entire interval I can then e.g.be obtained by smooth interpolation of the corresponding robust sample estimates [Wang et al., 2015].Smooth interpolation of obtained robust estimates is beyond the scope of this paper; see further e.g.Berkey et al. [1991], Castro et al. [1986] for more insight on FDA for regularly sampled data on a dense grid.
An arguably more common and difficult case is referred to as sparse data.Sparse functional data represents situations where only a small number of irregularly-spaced measurements are available for each observed function.Sparse data may arise due to limitations in data collection or due to the nature of the phenomenon being observed.Analyzing sparse functional data poses unique challenges, as there are fewer data points available to capture the underlying patterns and structures.Therefore, specialized techniques are often employed to effectively handle and extract meaningful information from sparse functional data; see e.g.Ramsay and Silverman [1997], Wahba [1990] for a general overview.A common approach in FDA when dealing with irregularly observed and sparse data is to assume that the functional data can be represented using a finite set of basis functions.We, therefore, express each observed datum X i (t), i = 1, . . ., n in a fixed, common basis Φ(t) = (φ 1 (t), . . ., φ M (t)) , t ∈ I as where C is a matrix of coefficients in the expansion.Then, X = 1 n n i=1 e i CΨ = 1 n 1 CΨ, where 1 is a vector of ones.The number of basis functions M is chosen individually for a given data set, however, usually in a way that the functional approximations are close to the original observations, with some smoothing that eliminates most of the noise.The choice of the basis is important, and common choices include polynomial functions, wavelets, Fourier bases, and spline bases, among others; for an overview, see e.g.Ramsay and Silverman [1997], Kokoszka and Reimherr [2017].Note that Φ is not necessarily an orthonormal basis, e.g.B-splines [see e.g.Ferraty andVieu, 2007, Eilers andMarx, 1996] is a common choice.Denote now M = Φ, Φ , where M i,j = φ i , φ j , i, j = 1, . . ., M .Since the basis functions φ are linearly independent, M is a regular symmetric matrix.Moreover, for u ∈ R M , u Mu = u Ψ 2 ≥ 0, where u Φ 2 is the squared L 2 norm of u Φ ∈ L 2 (I).Thus, M −1/2 is well defined, making Φ := M −1/2 Φ orthonormal.Equation ( 9) can then be rewritten as where C = CM 1/2 is the matrix of coefficients in the expansion using new, orthonormal bases Φ.This indicates that we can, without loss of generality, assume that the basis functions are orthonormal, i.e.M = I, and in the continuation, we proceed by assuming so.Once the data is represented as in (10), one might sample the smoothened functions on an arbitrarily dense grid and proceed with an algorithm as in the case where the underlying functions are observed on a regular, dense grid.However, the following proposition indicates that it is not necessary and that we can proceed to work solely with the matrix of coefficients C, thus lowering the computational cost and minimizing the approximation error.
Using Proposition 5.4, we give the modification of Algorithm 1 in the case where the data are represented in a finite orthonormal basis Φ.
Algorithm 3: Fixed point algorithm for solving (6) when X = CΦ, for orthonormal basis Φ Input :Orthornormal basis Φ = (φ 1 , . . ., φ M ), data matrix X = (X 1 , . . ., X n ) , an initial subset H 1 ⊂ {1, . . ., n}, regularization parameter α > 0, and tolerance level ε k > 0; Express X in a basis Φ as X = CΦ, where C ∈ R n×M is a matrix of coefficients; do The adaptation of Algorithm 2 is straightforward and is thus omitted.The rank( CH 0 ) ≤ h ≤ n indicates that the matrix becomes singular when the number of basis functions exceeds the number of observations.However, by adding αI to CH 0 , where α > 0, the resulting matrix CH0 + αI remains regular.This property allows for the use of an arbitrarily large number of basis functions when expressing the data, thereby reducing the impact of the specific choice of basis.It is important to note that if a small number of basis functions is employed in practice, careful selection of the basis functions becomes crucial to capture most of the data's variability.

Simulation study
This section aims to assess the accuracy of the proposed MRCT algorithm (1) through a simulation study similar to the one conducted in Berrendero et al. [2020].In their work, the authors extended the framework from Arribas-Gil and Romo [2014] to compare various functional outlier detection methods with the newly introduced regularized Mahalanobis distance.The selected methods can be classified based on their underlying principles.The first group of outlier detection methods relies on different functional depths, which quantify the outlyingness of curves based on their centrality within a data cloud [Fraiman and Muniz, 2001].In this study, we consider the following methods: Functional boxplots [Sun and Genton, 2011], which depend on the centered-outward ordering of the Band depth introduced in López-Pintado and Romo [2009]; Depth-based trimming and weighting [Febrero et al., 2008], iterative approaches that iteratively remove outliers from the data until no atypical curves remain; The adjusted outliergram [Arribas-Gil and Romo, 2014], which combines two depth-based measures into a single index for assessing outlyingness.
The second category of outlier detection methods utilizes functional principal components.In this category, we consider the integrated square error Hyndman and Shahid Ullah [2007], which calculates the integral of the squared difference between a function and its projection onto the first K robust principal components.
The final group of methods is based on the Mahalanobis distance and includes: The robust Mahalanobis distance, which treats the curves as multivariate data and calculates this measure based on a robust version of the sample covariance.Additionally, not considered in the previous analysis, we examine the MRCD estimator, which also treats functional data as multivariate; The Mahalanobis distance based on the Reproducing Kernel Hilbert Space (RKHS) [Berrendero et al., 2020].However, due to its reliance on the MCD estimator, it can only be applied in the p < n setting.The MRCT algorithm proposed in this paper also falls within this category of methods.
These methods from the different categories collectively form the basis for evaluating the performance of the MRCT algorithm in outlier detection.
In the simulation, we consider three different models that correspond to various outlier settings.These models are designed to capture different scenarios and provide a comprehensive evaluation of the performance of the methods under varying outlier conditions.
Additionally, u follows a Bernoulli distribution with parameter 0.5, and µ is uniformly distributed on the interval [0.25, 0.75].
We conduct an analysis for each model by generating a random sample of n = 200 observations.These observations are recorded at equidistant time points on the interval [0, 1], with two different settings for the number of variables, namely p = 100 and p = 500.In the context of multivariate statistics, we refer to the p = 100 and p = 500 settings as low-and high-dimensional, respectively.
The number of outliers in each setting is determined as n • c , where c represents the contamination rate.In our study, we consider contamination rates of 0, 0.05, 0.1, and 0.2, allowing us to examine scenarios with no outliers, a small percentage of outliers, and a higher percentage of outliers.Figure 2 provides a visual representation of the different main and contaminated processes in the simulation.To identify the outliers, we compare the obtained robust α-Mahalanobis distances from Algorithm 1 to the cutoff value corresponding to the estimate of the 99%-quantile of the limiting distribution of the robust α-Mahalanobis distances, under the assumption of Gaussianity; see Corollary 5.1.To estimate the cutoff value, we conduct a Monte Carlo simulation with 2000 repetitions.
The performance of all methods is evaluated using three metrics: the true positive rate (TPR), which represents the ratio of correctly classified outliers, the false positive rate (FPR), which represents the ratio of misclassified regular observations, and the F-score calculated as TPR/(TPR+0.5(FPR+FNR)),where FNR corresponds to the ratio of nonidentified outliers.Each setting is replicated 50 times to ensure robustness of the results.For the MRCT method, the subset size h is consistently set to 75% of the data, while the regularization parameter α is determined following the procedure described in Section 5.1.
Outlier detection rates for all three models are given in Figure 3, where, for the sake of conciseness, we present the results for the contamination rate of c = 0.2.The figure includes true positive and true negative rates, as well as the F-score, over 50 replications.Boxplots visualize the rate distributions.
The MRCT algorithm demonstrates good overall performance, often outperforming others in both low-and highdimensional settings.It achieves high true positive rates, correctly classifying almost all outliers, and maintains desirable true negative rates.Some methods yield higher true negative rates but are rather conservative in the sense that they overlook outliers.The MRCT strikes a good balance, correctly classifying outliers while retaining regular observations.
In addition to outlier detection, accurate estimation of the data covariance is very important.Therefore, for each replication, we assess the accuracy of the sample estimate vreg of the covariance function γ of the non-contaminated process using the approximation of an integrated square error (ISE) calculated at the observed time-points; ISE = p −2 p i,j=1 (γ(t i , t j )− vreg (t i , t j )) 2 .For each method, vreg is a sample covariance estimator calculated at the identified non-outliers.The results are shown in Figure 4.The computational effort is another important aspect of an algorithm.For a fixed iteration of Algorithm 1, in STEP 3, the cost of computing ĈH0 is O(p 2 * h), while the cost of computing the corresponding eigendecomposition is O(p 3 ).Further, for a fixed iteration of the inner loop, in STEP 7, the cost of calculating standardized observations is the cost of matrix-vector multiplication, i.e.O(p 2 ), as the eigendecomposition of ĈH0 is precalculated, thus simplifying the calculation of the corresponding matrix square root and inverse.The simulation study indicated that both the inner and the outer loops converge in just a few iterations, thus suggesting that the cost of Algorithm 1, depending on a number of observed time points p and the subset size h is O(max{h * p 2 , p 3 }).The algorithm is applied to two real-world datasets, starting with an example focusing on the El Niño-Southern Oscillation (ENSO) phenomenon in the equatorial Pacific Ocean.ENSO is a recurring climate pattern characterized by two well-known phases called El Niño and La Niña.These events involve unusual warming of the ocean surface, which can have global impacts on weather conditions [Trenberth, 1997].One classification method for identifying these events is the Oceanic El Niño Index (ONI), which utilizes 3-month running means of sea surface temperature (SST) in specific regions of the Pacific Ocean.An indicator of an ENSO event is when five consecutive means are above or below a threshold.The relevant region for the analysis is Niño 1+2, located near the west coast of South America.Weekly temperature data, available at https://www.cpc.ncep.noaa.gov/data/indices/,from March 1982 to February 2023 is considered, with each yearly episode consisting of the initial observation in March and the following 52 weeks, resulting in a 41 × 53 data matrix.
The objective of this analysis is to identify atypical behavior in the yearly sea surface temperature, indicating the presence of an El Niño event.The regularization parameter, denoted as α, is automatically selected using Algorithm 2 and set to α = 2.The subset size h is fixed at 0.75n .The identified outliers are depicted in Figure 6, and correspond to the seasons of 1982/83, 1983/84, 1997/98, 1998/99, and 2015/16.These seasons are associated with some of the strongest El Niño events in the years 1982/83, 1997/98, and 2015/16, and the outliers have also been detected using other methods, as referenced e.g. in Huang and Sun [2019] or Suhaila [2021].
The prediction of such events holds significant relevance in the agricultural sector.Therefore, it is of interest to investigate whether these events can be forecasted using information from preceding years, either at the beginning or after a few months into the episodes.In this study, we explore this notion by initially considering only the temperatures of the first 8 weeks, approximately corresponding to March and April, and subsequently extending the temperature values by incorporating data from the following weeks, making it in total 46 data sets in which outlier detection is performed.As no information was available prior to 1982, we specifically focus on the more recent outlier episodes of 1997/98, 1998/99, and 2015/16.In cases where the data is sparse due to a limited number of available weeks, we apply data smoothing techniques as described in Section 5.2 using Algorithm 3. We utilize a B-spline basis, where the number of basis functions depends on the number of weeks but is capped at a maximum of 15.The findings are presented in Figure 6, showcasing the results obtained from analyzing the smoothed data.For the episode spanning from 1997 to 1998, the outlying behavior emerges at the beginning of June, as denoted by the black dot.Subsequently, the observations consistently exhibit outlying characteristics, leading to the different linetype of the corresponding curve from that point onward.By initially considering the data comprising the temperatures of the first 8 weeks, the observations from the following year, 1998/99, are promptly identified as outliers due to their continuation of the trend of above-average sea surface temperatures.As the year progresses, the El Niño event weakens, evident from the declining SST values.The subsequent El Niño event in 2015/16 exhibits distinct patterns, suggesting differences in its underlying dynamics.Consequently, this observation is only identified as an outlier starting from mid-July.
Furthermore, an investigation of the current season, 2023/2024, was also conducted.At the time of analysis, only 11 weeks of information until mid-May were available.By performing a similar forecasting study using the first 8 to 11 weeks of data from all available years, indications suggest the onset of the next El Niño event.This is represented by the purple curve in Figure 6, displaying abnormally high sea surface temperatures around March and April.Similar to the 1998 El Niño event, it is already detected from the initial stages.The black dots indicate the specific points in time when El Niño events were detected solely using past information.

EPXMA spectra
The following datasets comprise electron probe X-ray microanalysis (EPXMA) spectra, specifically spectra of various glass vessel types.These data were obtained during an analysis conducted in the late 1990s at the University of Antwerp, with the objective of studying the production and trade relationships among different glass manufacturers [Janssens et al., 1998].The glass vessels, originating from the 16th and 17th centuries, were subjected to archaeological analysis.A total of 180 observations were recorded for 1920 wavelengths.Our focus is on the first 750 timepoints, as they encapsulate the primary variability within the data.The data comprises four distinct types of glass vessels.The primary group, referred to as the "sodic" group, encompasses 145 observations.The remaining three groups are denoted as "potassic," "potasso-calcic," and "calcic," consisting of 15, 10, and 10 measurements, respectively.Further details regarding the different glass types can be found in Lemberge et al. [2000].The corresponding curves are depicted in the left plot of Figure 7. Notably, the potassic and calcic observations exhibit deviant shapes from the majority at approximately 200, in addition to having higher amplitudes within the 300 to 350 wavelength range.On the other hand, the potasso-calcic group does not exhibit apparent outlying characteristics.
For this analysis, we consider the sodic group as the reference or regular observations, to identify the remaining curves as outliers.The optimization parameters follow the same selection process as before.The subset size h is set to 0.5n and the regularization parameter α is chosen through automated selection, resulting in α = 2.
In the right plot of Figure 7, the resulting α-Mahalanobis distances are displayed, with colors representing the grouping structure of the data.The vertical black line indicates the theoretical cutoff 5.1.Based on this analysis, the second, third, and fourth group are identified as outliers.Additionally, a few curves from the majority group are also detected as outlying.Further investigation revealed that some of the spectra within the sodic glass vessels were recorded under different experimental conditions.This discrepancy may account for the atypical curves observed within the sodic group.

Summary and conclusions
In this paper, we presented the minimum regularized covariance trace (MRCT) estimator, a novel method for robust covariance estimation and functional outlier detection.To handle outlying observations, we adopted a subset-based  approach that favors the subsets that are the most central w.r.t the corresponding covariance, and where the centrality of the point is measured by the α-Mahalanobis distance (4.1).The approach results with the fast-MCD Rousseeuw and Driessen [1999] type algorithm, in which the notion of standard Mahalanobis distance is replaced by α-Mahalanobis distance.Consequentially, the robust estimator of the α-Mahalanobis distance is obtained, thus further allowing for robust outlier detection.The method automates outlier detection by providing theoretical cutoff values based on additional distributional assumptions.An additional advantage of our approach is its ability to handle data sets with a high number of observed time points (p n) without requiring preprocessing or dimension reduction techniques such as smoothing.This is a consequence of the fact that the X α st is a smooth approximation (obtained via the Tikhonov regularization) of the solution Y of ill-posed linear problem C 1/2 Y = X, where the amount of smoothening is determined by a (univariate) regularization parameter α > 0. Therefore, a certain amount of smoothening is in fact done within the procedure itself.Additionally, the typically challenging task of choosing the Tikhonov regularization parameter α is automated.We demonstrate the importance of selecting an appropriate value for α, which determines the optimal subset, striking a balance between noise exclusion and preservation of signal components.Moreover, we provided the straightforward adaptation of the method to sparsely observed data, where the approach is to represent the data on a fixed basis and directly work with the finite matrix of basis coefficients, as opposed to the smoothed functions, further minimizing the approximation error.Overall, the MRCT estimator offers distinct advantages in terms of robust covariance estimation and automated outlier detection in functional data analysis, as demonstrated in the simulation study in Section 6.
As a part of future research, we will study the behavior and efficacy of different variants of Tikhonov regularization.While our paper focused on the "standard form" with a simple smoothness penalty of α||Y || 2 , future investigations could encompass a broader scope by incorporating a derivative operator, denoted as α||LY || 2 , where L represents a suitable operator; see e.g.Golub et al. [1999].Another avenue for exploration lies in extending the approach to accommodate also the case of multivariate functional data.Figure 8 shows the impact of α to eigenvalues of C(X α st ), for a range of eigenvalues of C. Especially, it illustrates that for small α, the value λ α i,st ≈ 1, if λ i is reasonably large.On the other hand, if α is too large, in case of large and diverse λ i , the corresponding values of λ st i might be too small and not "close enough to each other", thus diminishing some of the signal components.Similarly, in the case of very small eigenvalues, too small values of α might make small, "irrelevant" eigenvalues large and diverse, thus magnifying noise.

A Additional simulation results
In Section 6, a simulation study is performed to evaluate the sensitivity of Algorithm 1 to the initial approximation.The simulation study utilized the regularization parameter determined by Algorithm 2. The results reveal that selecting the appropriate value of α can significantly reduce this sensitivity.Consequently, it is observed that the choice of H opt ∈ X has minimal influence on the final outcomes, including the covariance estimate and outlier detection.To quantify this, the first row of Table 1 presents the average overlap between the obtained subsets using 100 different initial approximations and the chosen optimal subset H opt .The overlap is calculated as O Let us first show that U h is a convex set.Let w 1 , w 2 ∈ U h and λ ∈ (0, 1), and define w = λw 1 + (1 − λ)w 2 .Then, , and order the obtained quantities as Then observe that if we were to optimize g w.r.t.w, keeping w 0 as a fixed parameter, the result of the given optimization problem is given explicitly with w i = 1, if i ∈ {i 1 , . . ., i h }, and 0 otherwise.Such w uniquely defines a corresponding set of observations with H = {i 1 , . . ., i h }.Define further f : U h → U h with f w (w 0 ) = arg min w∈U h g(w; w 0 ).Then any fixed point w 0 of f w , due to the same reasoning as above, uniquely defines a corresponding set of observations H 0 .Hence, if we show that f w has a fixed point, then that fixed point uniquely defines a fixed point of f .Now, since g is a continuous function (as a composition of continuous functions) on its domain, Berge's maximum theorem [Berge, 1963] implies that f w is continuous as well.Finally, as f w is a continuous mapping from a compact and convex set U h into itself (not necessarily surjective), Brouwer's fixed-point theorem [Brouwer, 1911] then implies that f w has at least one fixed point w 0 ∈ U h that, for the reasons stated above, defines a fixed point H 0 of f .Proof of Proposition 5.1.Berrendero et al. [2020]).The rest of the proof follows the one of Proposition 5.1.

Figure 2 :
Figure 2: Samples from three models considered in the simulation study.Grey observations represent the main processes, while the black-colored functions indicate the outliers.The contamination rate is c = 0.1, sample size n = 200, and p = 100 time points.
Figure 5 depicts the average runtime of Algorithm 1 (in seconds), as a function of a number of observed time points p = 50, 100, 150, 200, 400, 750, for the sample size n = 100, 200, 500.Dotted lines correspond to the cubic function fitted to the average runtime as a function of p.

Figure 3 :Figure 4 :Figure 5 :
Figure 3: Left to right: boxplots of true positive rate, true negative rate, and F-score for different methods over the three models; n = 200, p = 100, 500, and c = 0.2.

Figure 6 :
Figure 6: SST of the Niño 1+2 region in the Pacific Ocean.The dataset spans from 1982 to the present.In the left plot, regular observations are depicted in grey, while outliers are represented by colored markers.The right plot showcases the outliers of the smoothed SST data, determined through forecasting based on the available historical information.The black dots indicate the specific points in time when El Niño events were detected solely using past information.

Figure 7 :
Figure 7: Left: Spectra of different glass vessel types over several wavelengths.Right: α-Mahalanobis distance of non-outliers on a logarithmic scale.The horizontal solid line indicates the theoretical cut-off value.

Figure 8 :
Figure 8: Contour plot of eigenvalues λ α st of C(X α st ), for different combinations of smoothing parameter α > 0 and eigenvalues λ > 0 of C.

Table 1 :
where H represents the obtained output H = H(H 0 ) ∈ X of Algorithm 1 for a randomly initiated H 0 .The results demonstrate that the overlap is close to 1, indicating a high level of consistency between the obtained subsets and H opt .Additionally, the second row of Table!1 gives the overall overlap between all obtained subsets, where for 100 randomly initiated H i 0 ,H i = H i (H i 0 ) ∈ X , i = 1, .. ., 100, the overall overlap is calculated as O 2 := Overlap ratio between optimal subset H opt and all obtained subsets (O 1 ) as well as the ratio of intersection between all subsets (O 2 ).Mean values where calculated over 100 iterations on 100 random initializations.Proof of Lemma 5.1.Let us first define the set U h = {(w 1 , . . ., w n ) ∈ [0, 1] n : n i=1 w i = h} and the function g