Spatially smoothed robust covariance estimation for local outlier detection

Most multivariate outlier detection procedures ignore the spatial dependency of observations, which is present in many real data sets from various application areas. This paper introduces a new outlier detection method that accounts for a (continuously) varying covariance structure, depending on the spatial neighborhood of the observations. The underlying estimator thus constitutes a compromise between a unified global covariance estimation, and local covariances estimated for individual neighborhoods. Theoretical properties of the estimator are presented, in particular related to robustness properties, and an efficient algorithm for its computation is introduced. The performance of the method is evaluated and compared based on simulated data and for a data set recorded from Austrian weather stations.


Introduction
The identification of multivariate outliers is probably one of the most important tasks in multivariate data analysis.A need to find outliers in order to make further analyses more reliable, or the direct interest in the outliers themselves motivate the numerous approaches available for multivariate outlier detection.The identified outliers are supposed to deviate to a certain extent from the main trend or structure of the data majority, and thus they are also called "global outliers" (Filzmoser et al., 2013).In contrast, the term "local outliers" refers to a setting where additional information regarding some kind of neighborhood is available, for example provided by spatial coordinates of the observations.Then, local outliers are observations which clearly differ from the multivariate measurements of their spatial neighbors indicating local anomalies that spark interest and make further analysis essential.Nevertheless, the values themselves might still be in an ordinary range of the data set, and thus the observation would not be outlying in a global sense.
Existing statistical approaches for multivariate local outlier detection are often based on a distance measure and neighborhood structure.A neighborhood a is defined as a subset of the set of observation indexes, say {1, . . ., n}.A p-variate observation x i , for i ∈ {1, . . ., n}, is defined to be in neighborhood a if and only if i ∈ a.The decision if some observation x is in a neighborhood is typically based on its spatial coordinates s(x).One way to construct the spatial neighborhood is to take a spatial k-nearest-neighborhood of each point x, where k ∈ N.For a fixed x, the spatial distance to another point y is defined as the Euclidean distance d x (y) = s(x) − s(y) = (s(x) − s(y)) (s(x) − s(y)) 1/2 .
A spatial neighborhood a k (x) can then be defined as the set of the k many spatial nearest observations, a k (x) = {x j : d x (x j ) ≤ d x(k) }, (1) where d x(1) ≤ d x(2) ≤ . . .≤ d x(k) ≤ d x(n) are the sorted distances to all observations x i , i = 1, . . ., n. Regarding local outlier detection, a distance measure often used to evaluate outlyingness is the pairwise Mahalanobis distance (MD).For a neighborhood a with covariance Σ a and an observation x i in neighborhood a, the pairwise MD is defined as for all j ∈ a.

arXiv:2305.05371v1 [stat.ME] 9 May 2023
In general, the MD describes the distance between two observations, where the Euclidean distance in the feature space is adapted according to local distribution properties.
The estimation method used to determine Σ a for all neighborhoods is key to good and reliable results.It is essential that outlying observations themselves are not affecting the estimation, since this could possibly mask outliers, leaving them undetected.Thus, a robust covariance estimation on the neighborhood level is necessary.One of the most widely used robust estimators for the covariance is the Minimum Covariance Determinant (MCD) estimator (Rousseeuw, 1984(Rousseeuw, , 1985)), where one has to identify the h sub-sample of observations (where h is fixed e.g. to half of the observations) that minimize the determinant of its sample covariance.The MCD covariance estimator is then given by the sample covariance of the h subset, multiplied by a consistency factor (Croux and Haesbroeck, 1999).For its computation, the fastMCD algorithm developed by Rousseeuw and Driessen (1999) introduces an iterative concentration step, so-called C-step, that guarantees a decrease of the objective function until convergence to a (local) minimum, making the MCD estimator faster and even more popular.The global minimum is then approximated by iterating for a number of random starting values and choosing the smallest local minimum.By selecting a small number of good deterministic starting values for the fastMCD, the detMCD algorithm from Hubert et al. (2012) improves run-time even more.In spite of these excellent features of the MCD estimator, as well as affine equivariance and high robustness, one drawback is that the concept is not applicable in case of singularity of the sample covariance matrix of the h subset, which can easily occur.As for many methods, regularity of the estimated covariance is also needed to compute Mahalanobis distances.Especially in a setting where we are restricted to local neighborhoods consisting of a possibly small subset of observations, we might have a situation where regularity cannot be achieved and thus an inversion of the local covariance matrix is not possible.One solution is to base the local estimation on regularized robust covariance estimators, such as the recently developed Minimum Regularized Covariance Determinant (MRCD) estimator from Boudt et al. (2020) (or also on the Fritsch estimator, Fritsch et al. (2012)).One of the many attractive properties of the MRCD is that a slightly adapted fastMCD algorithm based on C-steps is also applicable.
Existing methods for local outlier detection have different ways to define the covariance matrix in order to ensure regularity.The method of Filzmoser et al. (2013) is dealing with regularity issues by using the MCD estimator calculated on the whole data set, i.e., Σ a = Σ, thus imposing a global covariance structure.They have shown that for i.i.d.Gaussian random vectors x 1 , . . ., x n with mean µ and covariance matrix Σ, the conditional distribution of the pairwise squared MD Σ (x i , x j ), for j = 1, . . ., n, given x i , is a non-central chi-square distribution with p degrees of freedom and non-centrality parameter MD 2 Σ (x i , µ).Instead of a fixed cut-off value for the pairwise MD, different sophisticated visual approaches are used.By plotting a degree of isolation based on the pairwise MD and quantiles of the non-central chi-square distribution, suspicious and highly isolated observations can be discovered and analysed in more detail.Quite contrary to Filzmoser et al. (2013), the method of Ernst and Haesbroeck (2016) uses a very local covariance estimation by taking individual k-nearest-neighbor (kNN) neighborhoods for each point separately into account.To tackle the regularity issues, they use a regularized covariance estimation (originally the estimator from Fritsch et al. (2012), for the MRCD see also the adaptation made in Bellino et al. (2019)) for each individual kNN neighborhood.Additionally, they introduce the concept of the "next distance", which is also MD based, and use the upper fence of the adjusted boxplot of Hubert and Vandervieren (2008) of all next distances as a non-parametric cut-off value for detecting outliers.
Since it is not necessary to use a MD concept to find local outliers, a short detour to machine learning techniques might be interesting.One of the most prominent approaches for detecting multivariate local outliers in machine learning is the local outlier factor (LOF) developed by Breunig et al. (2000).Initially, locality refers to multivariate values and Euclidean distances in the feature space but this method can also be canonically adapted to spatial local outlier detection.In Schubert et al. (2012) this adaptation and further LOF-based approaches are discussed.Interestingly, also LOF and its variants are based on a concept of distance and neighborhoods.
The existing methods have shortcomings in various ways that have not yet been properly addressed.The rather global nature of the method of Filzmoser et al. (2013) leads to a reliable and robust estimation of the covariance.Nevertheless, it is somewhat questionable if the estimated covariance is applicable and representative for the covariance structure on the local level.Trying to solve this issue of missing locality in the estimation, Ernst and Haesbroeck (2016) resort to a very local approach by recalculating the covariance matrix for each observation separately.Although more locality is achieved, the method is not taking into account that covariance matrices are not likely to change abruptly from one neighborhood to a next one.Also, the number of estimated parameters is extremely high and based on rather few observations, even if the local covariance structure is abruptly changing.A more global estimation of the local covariance matrices might be more stable and reliable and might also avoid repetitive calculations.It seems that until now there are only two extremes regarding locality of the covariance estimation available.
We bridge the gap between the fully global and the fully local approach by providing a covariance estimator based on the MRCD that addresses the missing locality on the one hand and the missing spatial smoothness on the other.By providing the possibility to set the amount of spatial smoothing and the size of the neighborhoods we get a generalization of the two detection methods, with the goal that good outlier detection properties based on the new local covariance estimations are achieved.Moreover, the covariance estimate can be seen as a generalization of the MRCD when the data set has additional sub-structures.
The paper is organized as follows.In Section 2 we introduce the new covariance estimator, derive its properties as well as properties of the original MRCD and establish the methodology to detect local outliers.An algorithm and the derivation of a generalized C-step are discussed in Section 3. Section 4 provides simulation results regarding run time, convergence and outlier detection, while in section 5, a real data set is analyzed using the newly developed local outlier detection method.Finally, the main results are presented and summarized in the conclusions.

Spatially smoothed MRCD estimators
Assume that the p-dimensional observations x i = (x i1 , . . ., x ip ) , for i = 1, . . ., n, are arranged as rows in the n × p matrix X with n > 2p.Furthermore, each observation has spatial information available, e.g.spatial coordinates, and is assigned to one of N many neighborhoods, defined by the index sets a 1 , . . ., a N of size n 1 , . . ., n N .
The goal of the proposed method is to obtain local covariance estimates for each neighborhood that are suitable for calculating the pairwise MDs and to some extent smooth among nearby neighborhoods.Since the MCD estimator requires at least n > 2p to provide a regular covariance matrix, which is a severe limitation especially for small neighborhoods, we focus on its regularized extension, the MRCD estimator.Instead of minimizing the determinant of the sample covariance matrix as in the MCD case, the minimization objective is the determinant of a convex combination of the sample covariance and a symmetric and positive definite matrix, the so-called target matrix T .Boudt et al. (2020) suggest a data-driven approach based on the condition number of the covariance matrix to set the degree of regularization ρ which is used in the convex combination.Regarding the target matrix T , it is sensible to choose a robust and regular covariance, e.g. a diagonal matrix based on univariate robust scale estimates.
In the following, we adapt the idea of the MRCD estimator to our setting of local and smooth covariance estimation.Let H = (H 1 , . . ., H N ) be subsets of the index sets a 1 , . . ., a N defining the neighborhoods.The size of each subset H i is h i = |H i | = αn i , for i = 1, . . ., N , where α is selected in the interval [0.5, 1], and • is the ceiling function, rounding up to the next integer.A smaller value of α will result in more robustness against outliers, and it would also be possible to adjust this value to each neighborhood individually.The observations of subset H i are written as matrix X Hi , with dimensionality h i × p.Let the neighborhood specific MRCD-based covariance matrix K i (H), for i = 1, . . ., N , be defined as with Cov(Y ) being the sample covariance matrix of Y , and c α a consistency factor for the proportion α (see Croux and Haesbroeck, 1999).The regularization parameter ρ i is set individually for each neighborhood, and it could also be chosen as zero if the estimated covariance matrix is already invertible.Finally, since we want to smooth the covariance matrices, it seems counter intuitive to choose neighborhood specific target matrices, which would also require more parameter estimations.Therefore, we assume a global target matrix T , taken as a robust and regular covariance matrix estimated based on the full data set X. Since we assume n > 2p we propose to use the MCD estimator for X as target matrix.
We want to find the combination of subsets in H that minimizes the objective function The tuning parameter λ ∈ [0, 1] is used to balance the influence of an individual local neighborhood and the remaining neighborhoods in the covariance estimations.In case of λ = 0, there is no spatial influence at all which is equivalent to the estimation of the MRCD for each neighborhood separately while using a global target matrix.For the other extreme λ = 1, the covariance matrix in a specific neighborhood is an average over the surrounding covariance estimates without adding local information from the neighborhood itself.Moreover, it is possible to interpret the second part in the determinant as a penalization term.Due to the minimization of the determinant, observations from a i that match well with the main trend of observations in neighborhoods with positive weights ω ij are more likely to be in the optimal H-set if λ is increased.The weights ω ij are supposed to be non-negative, and we set ω ii = 0.The elements of the weight vector ω i = (ω i1 , . . ., ω iN ) indicate the relative influence that the estimated covariances of other neighborhoods have on the covariance estimation of the i-th neighborhood.Also, each weight vector has to sum up to one, N j=1 ω ij = 1 for all i = 1, . . ., N .All these weights need to be pre-specified, for example based on inverse geographical distances, and are collected as rows in the weighting matrix W ∈ R N ×N .
Note that for the objective function (3) a global minimum H * = (H * i ) i=1,...,N exists since its domain consists of a finite number of subset combinations.For this global minimum, the estimated covariance matrix for each neighborhood a i is and the location estimate μSSM,i is the sample mean of the selected observations X H * i .We call these estimators the spatially smoothed MRCD (ssMRCD) location and covariance estimators.
Although the neighborhood structure and the value of λ will often depend on the data at hand, there are some sensible and natural choices for W .If we have a neighborhood structure that has no further meaning and might just be used to divide the spatial space into subsets, an inverse-distance based weight matrix, also used in Sections 4 and 5, might be a good choice.Other possibilities include binary matrices with ones if neighborhoods share a border and zero otherwise, with rows scaled appropriately.Moreover, the regularity parameters can be set by default.For a neighborhood a i we suggest to set the regularization parameter ρ i as the data driven value that is proposed by the MRCD algorithm in Boudt et al. (2020), when interpreting the neighborhood as its own data set.

Theoretical properties
In the following we will show that the spatially smoothed MRCD estimators proposed here are -in contrast to the original MRCD estimator -affine equivariant, and we derive its breakdown point.For a data set X ∈ R n×p , a location and covariance estimator are called affine equivariant for all neighborhoods if for any non-singular matrix A ∈ R p×p , any vector b ∈ R p , and all i = 1, 2, . . ., N , it holds that μSSM,i (XA + Theorem 1 (Affine equivariance).Let T be any robust, regular and affine equivariant estimate of the covariance for the data set X, here denoted as T (X).Then, the spatially smoothed MRCD estimators of location and covariance with target matrix T (X) are affine equivariant.
Proof.The proof is given in the appendix.
The theorem assumes that the estimator T (X) is regular, robust and affine equivariant which are properties that can be achieved by the MCD for the full data set X for n > 2p.However, this is a non-trivial problem in cases where we do not have enough observations or where additional regularization would be needed.Since this is the situation where the MRCD should provide a remedy, the target matrix for the original MRCD will most likely not fulfill the assumptions stated in the theorem.Thus, the MRCD is in general not affine equivariant.
Another important property of robust estimators is the finite sample breakdown point, which is defined as the minimal fraction of points that need to be exchanged in order to make the estimators useless.Before considering the spatially smoothed MRCD we have to derive the breakdown point of the original MRCD without prior scaling of the observations, from now on called raw MRCD.The breakdown point of a location estimator μn is formally defined as where X n,m is the data matrix X n with m-many observations exchanged with arbitrary values (Maronna et al., 2006).
For the covariance estimate Σn the finite sample breakdown point is defined as * n ( Σn ; with λ 1 (Σ), . . ., λ p (Σ) denoting the eigenvalues of a matrix Σ in decreasing order.Since the eigenvalues are sorted, we only have to consider the biggest eigenvalue λ 1 ( Σn (X n,m ))) which might explode when exchanging observations with arbitrary values (explosion breakdown point) and the smallest eigenvalue λ p ( Σn (X n,m ))) which might become zero (implosion breakdown point) and thus implies singularity (Maronna et al., 2006).
Theorem 2. Consider the raw MRCD estimator with fixed ρ > 0, regular and fixed T = QΛQ and the data matrix X n .Then, the following statements hold: a.The MRCD location estimator μn has the finite sample breakdown point min(h, n − h + 1)/n.
b.The MRCD covariance estimator Σn has the finite sample explosion breakdown point (n − h + 1)/n.
c.The MRCD covariance estimator Σn has the finite sample implosion breakdown point 1.
Proof.The proof is given in the appendix.
Regarding the finite sample breakdown point of location and covariance estimators with multiple estimators let us define the finite sample breakdown point * n as the minimal fraction of points of the same arbitrary neighborhood that need to be exchanged in order to make at least one of the estimators useless.For the location estimators μSSM,n,i , i = 1, . . ., N , the formal definition is * n ( μSSM,n,i where X i n,m is the matrix X n with m-many observations of neighborhood a i exchanged with arbitrary values.
Theorem 3. The location estimators μSSM,n,i N i=1 of the spatially smoothed MRCD have the finite sample breakdown point * n ( μSSM,n,i Proof.The proof is given in the appendix.
For the covariance estimates ΣSSM,n,i , i = 1, . . ., N , the finite sample breakdown point is defined accordingly, * n ( ΣSSM,n,i Again, we can differentiate between explosion and implosion breakdown point.Theorem 4. Given a fixed and regular target matrix T , the finite sample implosion breakdown point of the spatially smoothed MRCD covariance estimators ΣSSM,n,i

The finite sample explosion breakdown point is
Proof.The proof is given in the appendix.
Note that for all the breakdown properties of the original and the spatially smoothed MRCD, the target matrix T is assumed to be regular and fixed.In applications the target matrix would be some covariance estimator T (X).Thus, it should be kept in mind that the implosion and explosion breakdown point of T (.) might also be relevant.

Local outlier detection
The final step for detecting outliers is to decide how the spatially smoothed covariances available for neighborhoods a i , i = 1, . . ., N , will be linked to the pairwise MD.
The method is based on Ernst and Haesbroeck (2016).In order to compare each observation x with its local neighbors we need a second neighborhood structure that provides spatially close neighbors in contrast to the structural neighborhoods a i that are used for the smoothed covariance estimation.Thus, we select some k ∈ N and calculate the spatial k nearest neighbors, b k (x), see Equation ( 1), where a typical value might be k = 10.However, when applying the method, the spatial structure of the data set should also be considered.
For each observation x ∈ a i , the next distance is defined as , which is the minimum of all pairwise MDs based on the covariance matrix ΣSSM,i with all observations in the spatial k nearest neighborhood b k (x).The neighborhood b k (x) is not necessarily a subset of a i .However, due to the spatial smoothing of the covariance matrices and the spatial correlation of regular observations, an observation y spatially close to x should be similar enough to not be classified as outlier even if the covariance matrix of another but close neighborhood a i is used.In the case of a strong difference between x and y, the observation y would still be classified as outlying.
The next distance d(x) is used as a measure of outlyingness.If the next distance is high, none of the observations in the spatial k nearest neighborhood is similar to the observation x, which means that x would be flagged as a local outlier.As a notion what values for the next distance are considered as high, a non-parametric cut-off value can be used based on the upper fence of the adjusted boxplot (Hubert and Vandervieren, 2008) using all next distances from all neighborhoods a i , i = 1, . . ., N .Observations above the cut-off value are considered to be local outliers.
Possible further extensions like the restriction to homogeneous neighborhoods as suggested in Ernst and Haesbroeck (2016) are not included but are interest of future research.

Algorithm and C-step
For computing the spatially smoothed MRCD location and covariance estimators we need to minimize the objective function (3).However, since the number of possible combinations of subsets is comparable with the MCD or the MRCD, it is in general not feasible to just calculate the value of the objective function for all these combinations and select the best one.Instead, the strategy of C-steps (concentration steps) introduced for the MCD estimator by Rousseeuw and Driessen (1999) will be adapted to this problem setting.Given an index set H 1 corresponding to h observations of the data matrix X, the C-step chooses the subsequent subset H 2 where the h observations with the smallest Mahalanobis distances, based on the arithmetic mean and sample covariance matrix of the observations from H 1 , are taken.Rousseeuw and Driessen (1999) have shown that this procedure converges to a local minimum.The C-step idea has also been extended for the MRCD (Boudt et al., 2020), and we will now adapt the generalized C-step to our setting.Theorem 5.For each j = 1, . . ., N , let ρ j ∈ (0, 1) and H 0 j be any starting subset of a j of respective size h j ∈ (n j /2, n j ).Let α j be the corresponding fraction of the observations used, α j = h j /n j .Let H 0 = (H 0 1 , . . ., H 0 N ) be the combination of the subsets and λ ∈ [0, 1) fixed.The target matrix T (X) is assumed to be positive definite, symmetric and fixed, and K j (H 0 ) is defined as in Equation ( 2) with T = T (X).Calculate the sample mean for each neighborhood a j , j = 1, . . ., N , based on the respective subset, m j (H Fix neighborhood a i .For x ∈ a i , let the MD-based measure with the subset given by H 0 be defined as For a new subset H 1 i of a i of size h i with Proof.The proof is given in the appendix.
The C-step theorem states that the objective function will decrease with every step as long as the other covariance estimators stay fixed.In the implemented algorithm described below, this will in general not be the case.However, the theorem and its proof should be sufficient to motivate and provide a reason for the algorithm proposed in the following.
The algorithm makes use of the C-step to solve the minimization problem based on Boudt et al. (2020).Suppose that we can estimate the target matrix T = T (X) by the affine equivariant MCD estimator, then we also obtain affine equivariance for the spatially smoothed MRCD.Thus, we can ignore the scaling step in Boudt et al. (2020) and reduce the number of parameter estimations.Using an eigen-decomposition of T = QΛQ , with Q containing the eigenvectors as columns, and Λ being the diagonal matrix of positive eigenvalues, we transform the observations, for i = 1, . . ., n.Thus, in the next steps we can use Z = (z 1 , . . ., z n ) as data matrix and T = I p as target matrix, which numerically simplifies the data-driven selection procedure for ρ j .
In order to start the iteration process by making use of the C-steps, we need starting values, which should be robust and regular covariance estimates for each neighborhood.As proposed for the original MRCD estimator, we will also use the deterministic MCD algorithm of Hubert et al. (2012) for each neighborhood separately.This approach results in six estimates for location and scatter for each neighborhood, which show especially good convergence properties.Furthermore, for each neighborhood a j , the value of ρ j is calculated using the data-driven selection procedure based on a maximal condition number according to steps 3.2 to 3.4 from Boudt et al. (2020).The set of six deterministic starting values for each neighborhood is restricted to those with a sufficiently small condition number (for more details see step 3.5.in Boudt et al., 2020).
One new issue that arises is the number of possible combinations of initial subsets: for N neighborhoods we end up with up to 6 N subset combinations as possible starting values.Since a high number of starting values might not be essential for a good approximation, we will consider an upper limit of 6N starting values in the following, and they will be selected at random out of the possible combinations that are left after the ρ-selection step.
Suppose now that we start the procedure with an initial subset H 0 = (H 0 1 , . . ., H 0 N ), then we apply the C-step for each neighborhood a i and obtain a new subset H i 1 .The combination of these subsets H 1 = (H 1 1 , . . ., H 1 N ) is used as starting point for the next iteration step, etc.After there is no change in the subsets, the iteration process stops (see Figure 1).After applying the C-step iterations for all starting values, we choose the subset combination with the smallest objective function value as the final subset combination H * = (H * 1 , . . ., H * N ).Although Theorem 5 is not proving that the objective function decreases with every step due to the additional covariance matrices being adapted separately for each neighborhood after each iteration step, simulation results show that the algorithm provides stable results and good monotonic behavior in most cases (see Section 4).
After receiving the final combination of subsets H * for each neighborhood, the matrices K Z i are back-transformed to The covariance estimate for neighborhood a i is then and the location estimate is the arithmetic mean of the optimal subset X H * i .The detailed numerical procedure is summarized as pseudo-code in Algorithm 1.
Regarding the tuning parameter λ there is no standard procedure to get a good value that is similarly automated like the calculation procedure for the regularization parameters ρ i .However, there are multiple possibilities demonstrated in Section 5 that can be used to choose λ in applications.
Starting values H 0 i ∀i = 1, . . ., N given Step 0 : Step 1 : Figure 1: Matrices used in the C-step after each iteration step.H j i are the selected subsets of neighborhood a j in step i, and K j i the corresponding regular covariance matrices.
Algorithm 1: Algorithm for the spatially smoothed MRCD estimator.
Step 1.1: Calculation of target matrix T using the MCD estimator on X; Step 1.2: Eigen-decomposition of T = QΛQ and transform observations according to Equation (6); Step 2: Initialization step according to steps 3.1 to 3.5 (without C-step) from Boudt et al. (2020) for each neighborhood; Get 6 initial deterministic sets of h i observations from a i according to Hubert et al. (2012); Calculate 6 initial covariance matrices and mean estimates; Select neighborhood specific ρ i via data driven approach; Filter subset of initial starting estimates according to condition number (step 3.5 from Boudt et al., 2020); end Select set of initial h-set combinations as starting values at random; Step 3.1: C-step: For each initial combination of subsets iterate until convergence; Step 3.2: Select best combination of subset based on objective function value; Step 4: Calculate K i (H * ) ∀i and use Equations ( 7) and ( 8) to get final estimates;

Numerical simulations
In order to test the new method, two simulation scenarios are set up which also incorporate the neighborhood structures necessary for the covariance estimation and outlier detection.For both setups we simulate covariance matrices which depend on a parameter δ, denoted by Σ (δ), where the entries are defined as (Σ (δ)) jk = δ (j−k) , for j, k ∈ {1, . . ., p}, leading to positive definiteness and symmetry.
Setup 1: The first setup is inspired by the original idea of covariance matrices smoothly transforming over space.We start by setting up the (two-dimensional) coordinates (s 1 i , s 2 i ) of the observations x i with n sim = 41 observations per coordinate axis, evenly spread between 0 and 20, resulting in n = n 2 sim = 1681 data points in total.For N sim many areas a lm , l, m ∈ {1, . .

√
N sim , leading the an evenly spaced grid.Thus, the area a lm consists of observations {x i : For either l or m equal to 1, the left edge of the interval (which would be zero) is included.The coordinate centers of the area a lm are defined as The observed values of observations in area a lm are then randomly drawn from a p-dimensional normal distribution N (µ lm , Σ (δ lm )), where µ lm := ((c , thus having entry values between 0 and 20.For the covariance matrix Σ (δ lm ) for areas a lm we use the structure described above with parameter δ lm defined as increasing smoothly from the left bottom to the right upper corner.A simulated data set with p = 2 is presented in Figure 2 (left) where the grid structure and the change of the mean are clearly visible.The resulting δ lm for each area is shown as well.Setup 2: To get a more flexible simulation setup, a random field, specifically the parsimonious multivariate Matérn model (see Gneiting et al., 2010;Ernst and Haesbroeck, 2016), is used as suggested in Ernst and Haesbroeck (2016) and Harris et al. (2013).Instead of the constructed matrices used in Ernst and Haesbroeck (2016) and Harris et al. (2013) we again choose a matrix structure Σ (δ) as described above with δ = 0.7, since it can be extended for higher dimensions.
For δ being set to 0.7 we get a range of high to low correlations reflecting different relationships between variables.
The spatial smoothness is assumed to be the same for all variables, and it is regulated by one smoothness parameter ν, which is taking values in {0.5, 1.5, 3}.A higher ν leads to more spatial smoothness and in general more distinct outliers after contamination.The spatial scale parameter a of the Matérn model is set to one.A grid structure for the coordinates of the observations with values ranging from 0 to 20 with grid size 0.5 is imposed, leading to 41 2 = 1681 observations overall, similar to the standard setting in setup 1.
Lastly, contamination with outliers is achieved by swapping coordinates of observations with each other.Filzmoser et al. (2013) exchange observations that are completely randomly chosen, whereas Ernst and Haesbroeck (2016) propose to swap the most extreme observations regarding the first score of the global robust principal components.In order to avoid the problem of exchanging whole areas of observations with each other due to high spatial correlation, once observations are swapped, their 15 closest neighbors are removed from the swapping process.This leads to a clear distinction of outlying observations without the possibility of other outliers being close (see also Figure 2).Thus, swapping according to Ernst and Haesbroeck (2016) should in general result in a better performance for all considered methods.Both swapping approaches in both setups will be analyzed with a varying contamination level β between 1% and 10%.
For the ssMRCD covariance estimation we impose a grid based neighborhood structure.Similar to the description of setup 1 we use evenly spaced borders and assign the observations to neighborhoods n i , for i = 1, . . ., N .Thus, the case N = N sim in setup 1 depicts a perfect match of the neighborhoods selected for the ssMRCD and the real underlying covariance structure.For N sim > N the ssMRCD uses less neighborhoods leading to more smoothness of the covariance estimation.The weighting matrix W is based on the inverse (Euclidean) distance of the centers, that are defined equivalently to setup 1.

Covariance estimation
First, we want to analyse the algorithm described and motivated in Section 3 in more detail.Since Theorem 5 is only valid for one varying covariance matrix and not for multiple varying ones, the convergence properties should be further examined.Figure 3 shows the objective function values along the iteration procedure for all starting H-set combinations for different parameter settings for both simulation setups.We choose p = 5 and a 5% contamination rate achieved by completely random swapping.The weighting matrix is based on inverse distances as mentioned in Section 2. Each panel reflects the convergence behavior of the objective function for one simulated data set.
Moving Matrix (λ = 0.Although the behavior differs in general, it is evident that the algorithm has overall very good convergence properties.
A high percentage of monotonically decreasing objective functions can often be achieved and the non-monotonically decreasing paths increase only marginally.The reliability of the convergence results in the simulation might possibly be due to the spatially correlated values which lead to rather small changes in the covariance matrices during the algorithm.Thus, for our simulated data sets, the assumption of fixed covariance matrices seems to be sufficiently met.Since in reality local outlier detection mostly makes sense only for spatially correlated data, the theoretical results from Theorem 5 proof to be even more valuable.
Moreover, the convergence takes place fast which might be caused by the choice of the good starting estimates of the detMCD algorithm (Hubert et al., 2012).Another reason might be that the number of observations that can be used is restricted by the neighborhood assignments to a smaller number than in a covariance estimation for the full data set.Very promising is also the rapid improvement at the very beginning, independent of the simulated data sets.
A second issue that should be discussed is computational efficiency.The iteration process and the increasing number of starting values with increasing N can have quite an impact on runtime.Nevertheless, as long as the number of neighborhoods N is not too big, the outlier detection method based on the ssMRCD is competitive with other local methods (Ernst and Haesbroeck, 2016), especially if p is large (see also Figure 10.For simulation setup 1 with N sim = 25, a 5% contamination rate through completely random switching and 100 repetitions are used and the parameters p, λ, N and n are each varied univariately.The default values for parameters not being varied are p = 5, λ = 0.5, N = 25, and n = 1681.
As depicted in Figure 4, the number of neighborhoods N and the number of observations n have the most influence on runtime with more than a linear increase.The nearly quadratic increase for N is partly due to the number of starting values increasing linearly with N , which could also be reduced to enhance efficiency if necessary.Interestingly, the dimension p has an approximately linear effect on runtime which is overall moderate.The smoothing parameter λ does not significantly change the runtime.

Outlier detection
Before comparing the performance in local outlier detection with other methods, the parameter sensitivity of the ssMRCD is analyzed in more detail.This simulation study should simplify the choices of λ and N in particular for real world data and focus on possible issues connected to suboptimal parameter settings.
For this purpose, setup 2 (random fields) is considered as simulation setting, with parameter ν = 3, and β = 5% completely randomly swapped observations.Special focus is put on the choice of λ and N , but also the effect of dimension p is analyzed.The other parameters are chosen in accordance to possible default settings.The weighting matrix is based on the inverse Euclidean distances of the centers of the neighborhoods a i .Since all considered methods propose k = 10 as a default value, we adhere to this setting for now.Each parameter combination was simulated for 100 different realizations.While Ernst and Haesbroeck (2016) suggest to use Cohen's Kappa as summary statistic of the confusion matrix, we will use the F1-score due to its good interpretability and suitability also for imbalanced classification data.Figure 5 shows the false negative rate, the false positive rate and the resulting F1-score plotted against varying values of λ, N and p, with default values of p = 5, N = 25 (implying n i ≈ 67 on average) and λ = 0.5.These values should reflect a quite general and unspecific parameter setting.For illustration purposes it is more informative to plot the average neighborhood size n i ≈ 1681/N instead of the number of neighborhoods.
The simulation results show that a higher λ decreases the false positive rate, it has marginal reduction effects on the false negative rate until too much smoothing masks real outliers.The overall performance increases moderately in λ, but for λ higher than 0.5 the increase is marginal.Thus, we propose a default value of 0.5 for λ to get the advantage of the decrease in the false positive rate while avoiding the masking effect for higher values.Compared to the influence of λ, the effect of the dimension p is more pronounced.Very small dimensions seem to cover outliers more effectively, probably due to less available information.Interestingly, the size of the neighborhoods seems to be relatively irrelevant, at least in this simulation setting.Only a small masking effect occurs similar to the effects of λ.Too big neighborhoods lead to too much smoothing.Thus, this might imply to choose a strategy of medium sized neighborhoods to increase efficiency in computation and reduce unnecessary regularization.This guidance for the parameter choices might be biased towards this specific simulation setting and not optimal in other settings, but fixing the parameters with at least a sensible value simplifies the overall procedure.
Regarding the suitability of the ssMRCD covariances for local outlier detection we compare its performance of outlier detection to the local outlier detection methods of Filzmoser et al. (2013) (F), Ernst and Haesbroeck (2016)(EH) and the local outlier factor methodology of Breunig et al. (2000), canonically adapted to spatial neighborhoods as described in Schubert et al. (2012) (LOF).Both simulation setups vary in the parameters p, ν and N sim , respectively, and both swapping processes are used, each combination simulated 100 times.All methods considered compare each observation to k many of its neighboring observations which we will assume to be equal to 10 for all methods.For the ssMRCD we will assume the default values (λ = 0.5, N = 25, and W based on inverse-distances) from the prior setup for parameter sensitivity analysis.
The outlier classification method of Filzmoser et al. ( 2013) has a parameter β F which is the percentage of neighboring observations that a local outlier is allowed to be similar to.Here we use the value β F = 0.1, as proposed in Ernst and Haesbroeck (2016), meaning that 0.1k = 1 observation is allowed to be similar within the 10 nearest neighbors.For inliers the expected value of the isolation degree is β F .If the actual degree of isolation is higher than the expected value, this signals local outlyingness.As cutoff for classifying an observation as local outlier we use twice the value of β F , so 20%.This cutoff is less strict than in the simulation setup from Ernst and Haesbroeck (2016) who take a cutoff of three times the expected value, i.e. 30%.Note that the methodology of Filzmoser et al. (2013) mostly focuses on visual outlier detection tools, so the cutoff value chosen here might not be optimal.
For the regularized spatial detection technique by Ernst and Haesbroeck (2016), the parameter β EH , which gives the fraction of the most homogeneous neighborhoods included in the outlier detection procedure, is set to one.In the simulations we are interested in all outliers, and the heterogeneity in the simulated data should be comparable for all of the observations.Thus, only considering a fraction leads to non comparable results.Moreover, the simulation results of Ernst and Haesbroeck (2016) show that over all considered setups, β EH = 1 is also optimal.As regular covariance matrix estimator we use the MRCD with the default target matrix (equi-correlated target matrix) and α = 75%.
Last but not least, the non-parametric LOF, which calculates a local outlier factor for each observation based on a comparison of the so-called "local reachability density" with its k-nearest neighbors, needs a cutoff value.Since there is no fixed rule on how to choose a cutoff value, a local outlier factor above 1.5 determines an outlier.This value is also used in the original paper of Breunig et al. (2000).
The results for both simulation setups and the completely random switching are shown in Figures 6 and 7.The results for the swapping method of Ernst and Haesbroeck (2016) are shown in Figures 8 and 9.
Starting with the false positive rate (FPR), we see that the method of Ernst and Haesbroeck (2016) has some issues with classifying too many normal observations as outliers in nearly all settings.This is likely due to the very local covariance estimation which might be too strict in general leading to a strong swamping effect, especially in settings where there is no strong spatial correlation of the observed values.Interestingly, the behaviour of the FPR for the method of Filzmoser et al. (2013) depends on the data simulation setup.For the moving matrix setup, the FPR is rather high, for random fields it is very low.The LOF and the ssMRCD-based outlier detection method have reliable low FPR for all setups.
The outcome for the false negative rate (FNR) is quite different.While for Ernst and Haesbroeck (2016) the FNR is in many settings below all other methods, this might just be due to the high FPR.The method of Filzmoser et al. (2013) has a very high FNR in most scenarios even in those with a high FPR.Regarding LOF, the simulation setup has a strong effect on its performance.For the moving matrix setup we see a rather good performance compared to the other methods, while for random fields the FNR can hardly keep up with the other methods except the method of Filzmoser et al. (2013).The ssMRCD method is somewhere in between.Although the corresponding FNRs for the moving matrix setup with completely random swapping are not overwhelming, they are still in a reasonable range for the switching method of Ernst and Haesbroeck (2016).Moreover, the FNRs for the ssMRCD outlier detection technique in the other scenarios are compellingly low.
Comparing the F1-scores of the different methods, the best method to use in general depends on the scenario.While the three selected methods seem to have pitfalls in at least one simulation setup, the ssMRCD-based method is consistently showing reliable results and is mostly among the two best methods.Moreover, less extreme behavior occurs when it comes to the FNR and the FPR.Thus, the ssMRCD-based local outlier detection method could be the method of choice for standard outlier detection tasks.
Moreover, with many dimensions and observations efficiency in computing becomes important.The results of a short simulation study regarding the runtime of the four outlier detection techniques with parameter settings according to the prior analysis are shown in Figure 10 for the moving matrix scenario with N sim = 100.All methods need more computation time for increasing dimension p, especially the methods based on covariance estimation and inversion (EH, ssMRCD, F).Also, the number of observations seems to have a big effect on runtime, especially for the method of Filzmoser et al. (2013), possibly due to an inefficient implementation of finding the k-nearest neighbors.Although the local outlier factor (LOF) method is reliably fast, the ssMRCD seems to be comparably efficient, even though it involves a complex covariance estimation procedure.

Example
In this section we consider a data set provided by GeoSphere Austria (2022) The coordinates used for all methods are given in latitude and longitude.
We set k = 10 for all methods, i.e. we want to compare one observation with its ten closest neighbors, independent of the methodology used.Although for the method of Ernst and Haesbroeck (2016) it is possible to remove observations with comparably high levels of heterogeneity among the neighbors, we want to include all observations, thus setting β = 1.Even if there is increased heterogeneity among the neighbors, an observation might still be an interesting outlier clearly visible with the naked eye.Moreover, as mentioned in the prior section, the simulation results in Ernst and Haesbroeck (2016) show the best performance for high β.For the methodology of Filzmoser et al. (2013) in accordance with the simulation setup we allow for one of the ten neighbors to be similar to the local outlier.The cutoff value is again set to 0.2.We use the same cutoff value of 1.5 for the LOF as in the simulations.
For the ssMRCD local outlier detection method we use a grid based neighborhood structure for the covariance estimation.Due to the Alpine landscape especially in the Western parts of Austria we aim at a rather local covariance structure, thus choosing a rather fine grid with N = 21 neighborhoods and n i ≈ 8.7 observations per neighborhood on average.Other possible options could be based on underlying structures, e.g.due to historical or political reasons, or on other classifying methods like clustering of the spatial coordinates.Furthermore, we use inverse-distance weights for the weighting matrix W between neighborhoods based on their center and select the default smoothing degree of λ = 0.5 to gain enough smoothing but still keep the locality of the fine grid structure.
As an alternative to using the default value of λ = 0.5 we can set up a simulation procedure.Assuming that the real data is uncontaminated, we can swap observations similar to the simulation studies in Section 4 and define them as local outliers.We can apply the outlier detection technique with the ssMRCD and different choices of λ (this can also be applied to other parameter settings of the ssMRCD), and then analyse the fraction of found outliers and the total number of outliers.Since only focusing on the known FNR for the found outliers leads to an increased false positive rate, it is sensible to also take the total number of found outliers into account.A good value of λ is a trade-off between a low FNR and a comparatively low number of found outliers overall.Interestingly, this procedure endorses the choice of λ = 0.5 in this data set.
The resulting ssMRCD correlation matrices for each neighborhood can be seen in Figure 11.The observations and the tolerance ellipses of the ssMRCD correlation matrices are colorized according to their neighborhoods.Since we have dimension p = 6, we reduce the dimensionality to the first two eigenvectors v 1 and v 2 of the global MCD correlation matrix T , which is displayed at the upper left corner, hoping to depict most of the relevant variance.Moreover, a biplot of T is added at the right hand side to link the correlation matrices to our weather data.Here, singular points are assigned to a neighboring neighborhood.For each neighborhood a i the index i is placed at the center, and the tolerance ellipses of corresponding correlation matrices are plotted along the first and second eigenvector coordinate of T , which can be seen in the upper left corner as reference.The biplot of T is shown on the right hand side.The observations and ellipses are colorized according to their neighborhoods.Applying all four outlier detection methods leads to 24 observations in total classified as outliers.The most outliers (21) are classified by the method of Ernst and Haesbroeck (2016), the least (3) by Filzmoser et al. (2013), which is consistent with the simulation results regarding FPR and FNR, especially for the random fields setup.The distances which are used for outlier detection for each method and observation are shown in Figure 12.For further comparison of the results, the upper part of Figure 13 shows all 24 classified outliers with the corresponding ratio of distance value to cut-off value.Ratios above one are outliers.We can see that there are multiple weather stations that are classified as outliers only by the method of Ernst and Haesbroeck (2016) which lends itself to a notion consistent with the simulation results that there are some false positives among these weather stations.One example for a false positive could be panel b) in in the lower part of Figure 13.The station Feuerkogel (panel a)) was not detected by the method of Filzmoser et al. (2013), also consistent to the simulation results for the random fields setup and the generally high FNR.Interestingly, also LOF seems to have drawbacks and fails for example for the weather station Patscherkofel (panel c)), which was not detected as outlier.Nevertheless, the weather station Schoeckl (panel d)) was detected by all of them.When looking at Figure 14 we can find some explanation for the local outlyingness for two of the three local outlier stations and why panel b) might not be outlying.While the stations Schoeckl and Feuerkogel are rather exposed on higher altitudes than most of the surrounding k-nearest stations which can easily lead to different patterns regarding weather, the station Linz-Stadt (panel b)) is in a rather flat area similar to its neighbors.The station Patscherkofel is already deep in the Alpine area and is surrounded by other stations in valleys but also on mountains.Although from panel c) in Figure 13 it is evident that Patscherkofel differs significantly in wind velocity, it is not clear why it differs so much also from stations with similar altitude and exposure.

Conclusions
In this paper we enhance the limited toolbox for multivariate local outlier detection by extending the approaches of Filzmoser et al. (2013) and Ernst and Haesbroeck (2016).The developed ssMRCD based on the MRCD (Boudt et al., 2020) bridges the gap between fully local and fully global covariance matrices used in the pairwise MD by exchanging the extremely local covariance matrices used in Ernst and Haesbroeck (2016) with spatially smooth estimates.
We define the ssMRCD by means of a minimization problem and prove theoretical properties of the estimator, such as equivariance and breakdown point.A heuristic is provided for the stable convergence property of the proposed algorithm under reasonable spatial changes in underlying covariance matrices.Moreover, the methods of Filzmoser et al. (2013), Ernst and Haesbroeck (2016) and the ssMRCD outlier detection method are compared with the local outlier factor adapted for local outliers (Schubert et al., 2012) regarding outlier detection performance and computational efficiency for simulated data and real world data from Austrian weather stations.
While we support the conclusion of Ernst and Haesbroeck (2016) that it is difficult to select the "best" method for outlier detection techniques, the ssMRCD-based outlier detection technique seems to be the only method providing reliable (but still improvable) results over all simulation scenarios.Additionally, it is able to compete with the other methods regarding runtime even though the computation is quite complex.However, for a thorough real data analysis it is still preferable to use different outlier detection methods and compare the results in order to exploit all possible advantages of the available methods.Comparing results of multiple methodologies provides more insight in the data and significant local outliers can be classified with more reliability overall.
The ssMRCD covariance structure can be exploited also beyond local outlier detection.All covariance based methods that are sensible to adapt to spatial data can be extended by using the ssMRCD instead, e.g.spatial principal component analysis.A special case for the application of the ssMRCD might also be spatial data with structural breaks that need to be considered in the analysis.Finally, the presented ideas could also be transferred to a time series context, where the spatial dependency is replaced by the temporal dependency of multivariate time series, and the dependence structure could change over time.Such settings are usually quite challenging for outlier detection.Weather stations (AUT) A Proofs for Section 2 (Methodology) Theorem 1 (Affine equivariance).Let T be any robust, regular and affine equivariant estimate of the covariance for the data set X, here denoted as T (X).Then, the spatially smoothed MRCD estimators of location and covariance with target matrix T (X) are affine equivariant.
Proof.Let i be fixed, T (X) be an estimator of covariance as described above and Y := XA +1 n b be the transformed data matrix.Then, for any subset combination H and for all j = 1, . . ., N it holds that By using the multiplicative property of the determinant and det(A) = 0 we see that A is only a constant in the minimization problem and is not affecting the choice of the optimal combination of subsets, Together with Equation ( 9), affine equivariance is proven for the covariance estimator.Since the location estimator is defined as the arithmetic mean, which is affine equivariant, the property stated in Equation ( 5) is also fulfilled.
Theorem 2. Consider the raw MRCD estimator with fixed ρ > 0, regular and fixed T = QΛQ and the data matrix X n .Then, the following statements hold: Proof.c. Regarding notation see also Boudt et al. (2020).The eigenvalues for the transformed covariance matrix H), where S W (H) is the covariance matrix of a subset H of X, are bounded below by ρ > 0. Thus, λ 2 ≤ 1/ρ where . 2 denotes the spectral matrix norm.For covariance matrices the spectral norm is equal to its biggest eigenvalue.It follows that is also regular and fixed.This implies that for all i = 1, . . ., p .It follows that for all subsets H, specifically for the optimal subset H * .Thus, the eigenvalues of Σn = QΛ 1/2 K W * Λ 1/2 Q are also bounded away from zero, and it follows that the implosion breakdown point is 1.
b.It is clear that * n ( Σn ; X n ) ≤ (n − h + 1)/n since in this case there would always be at least one observation in the selected subset independent of its value spoiling the estimation.We need to show that * n ( Σn ; Since ln(λ 1 ( Σn (X n )) is constant and ln is monotonously increasing and unrestricted we can w.l.o.g.assume that the value inside of the absolute value is non-negative.Additionally, moving the constant ln(λ 1 ( Σn (X n ))) to the right hand side, Equation ( 10) is equivalent to the unboundedness of the biggest eigenvalue λ 1 ( Σn (X n,m )), Note that det(X) = p i=1 λ i (X) for any p-dimensional matrix X and that Σn (X n,m ) = (1 − ρ)c α Cov(X n,m;H * ) + ρT , where H * is the subset of observations of X n,m that minimize det((1 − ρ)c α Cov(X n,m;H ) + ρT ) over all subsets H.As shown above, the eigenvalues of (1 − ρ)c α Cov(X) + ρT are bounded away from zero for all X and ρ > 0, For the matrix X n,m;H * , it follows that Let the constant C be defined as where H = m + 1, . . ., m + h denote h indices of fixed and unchanged observations of X n,m , which exist due to m ≤ n − h.Then, due to condition (11) for C there exists This contradicts the minimization of the determinant property of the selected subset H * .Thus, * n ( Σn ; X n ) > (n − h)/n.a.Using the same argument as before, it is clear that * We have to find m = h many exchanged data points in a way that the norm of the location estimator is unbounded but the determinant of the covariance matrix is still minimal.For the fixed data set X n , we obtain the optimal subset H * of observations and add a fixed but arbitrarily large number L > 0 to the first coordinate of these m = h observations, ∀i ∈ H * : xi1 = x i1 + L and xij = x ij ∀j = 2, . . ., p.
Thus, the sample mean of the first coordinate of the selected subset X n,m;H * is equal to the original mean of the first coordinate of X n;H * plus L. Similarly, the sample covariance is the same as before given that we take the same subset H * , since it is independent of constant shifts applied to all used observations.This implies that also the regularized covariance and its determinant are the same which was minimal for all other subsets of X n .In order to show minimality of the subset H * for the new data matrix X n,m it follows that we only have to consider the subsets that have both original and exchanged (arbitrarily large) observations.
Regarding the sample mean of the first coordinate of one of these subsets H, it is where M 1 is the (fixed) mean of the first coordinate of the subset X n; H and both sums are not empty.Regarding the variance of the first coordinate, which is the first diagonal entry of the sample covariance matrix, we see Thus, the Frobenius norm of Cov(X n,m; H ) and also its regularization are O(L 2 ) and it follows for some constant β > 0 that due to equivalence of matrix norms in finite dimensional space and c being the constant from above.Choosing L arbitrarily large, we see that the determinant corresponding to a mixed subset is larger than the determinant of the optimal subset H * of only exchanged observations.Now suppose * n ( μn ; X n ) = m/n < min(h, n − h + 1)/n and start from Equation ( 12), ∀C > 0 ∃ x * 1 , . . ., x * m : μn (X n,m ) > C.This implies that ∀C > 0 ∃ x * 1 , . . ., x * m : where i j ∈ H * , j = 1, . . ., h.Thus, for all C > 0 there exists some x * ij whose norm is bigger than C. W.l.o.g.assume it is x * 1 and that the first coordinate is responsible, For m < h < n − h + 1 there would not be the possibility to only include exchanged points in the subset and it would always be possible to have a subset of h many original observations.This is also the case for m < n − h + 1 < h.Thus, there are at least one exchanged point x * 1 and one original point in H * .But with the same argument as before, the determinant of the mixed subset of original points and arbitrarily large points would eventually contradict optimality, because at one point the determinant would be so large that there would be an h-sized subset of original observations available to get a smaller determinant.Proof.For i = 1, . . ., N , the location estimate is the standard sample mean of h i many observations from neighborhood a i selected in a way to minimize the objective function (3).By exchanging observations in one neighborhood a i with arbitrarily large values and keeping the other neighborhoods the same (keeping the matrices K j bounded), we can apply the results of Theorem 2 for the MRCD structured covariance matrix K i .The breakdown point for K i is min(n i − h i + 1, h i )/n i .Thus, in order to make at least one of the location estimators useless we need to exchange a fraction min i=1,...,N min(n i − h i + 1, h i )/n i of observations of one neighborhood.Proof.Since the spatially smoothed MRCD covariance estimators K i are regularized on each neighborhood according to the MRCD approach, all eigenvalues are positive and bounded away from zero as long as the target matrix T is regular (see Theorem 2).Hence, none of the covariance estimators will ever be singular and the implosion breakdown point is 1.
For the second part let us fix neighborhood a i .Note, that the covariance estimator is defined as ΣSSM,n,i = (1 − λ)K i (H * ) + λ N j=1,j =i ω ij K j (H * ) for the optimal subset H * .The matrix K i = K i (H * ) is structured in an MRCD manner based on the sample covariance matrix of the subset H * i and the target matrix.Since we assume T to be fixed, for K i we can get arbitrarily large eigenvalues only under the same circumstances as for the MRCD (see Theorem 2).Thus, exchanging a fraction of (n i − h i + 1)/n i by arbitrary values can lead to arbitrarily large eigenvalues of K i .For the explosion breakdown point for one neighborhood covariance estimator ΣSSM,n,i (under general settings for W and λ) it is sufficient that at least one K j has reached its breakdown point.It follows, that the finite sample explosion breakdown point of ΣSSM,n,i is * n ( ΣSSM,n,i ; X n ) = min i=1,...,N {(n i − h i + 1)/n i }. (13) Since * n ( ΣSSM,n,i ; X n ) is already independent of i, the overall explosion breakdown point for the spatially smoothed MRCD covariance estimators is equal to * n ( ΣSSM,n,i ; X n ).
B Proofs for Section 3 (Algorithm and C-step) Theorem 5.For each j = 1, . . ., N , let ρ j ∈ (0, 1) and H 0 j be any starting subset of a j of respective size h j ∈ (n j /2, n j ).Let α j be the corresponding fraction of the observations used, α j = h j /n j .Let H 0 = (H 0 1 , . . ., H 0 N ) be the combination of the subsets and λ ∈ [0, 1) fixed.The target matrix T (X) is assumed to be positive definite, symmetric and fixed, and K j (H 0 ) is defined as in Equation ( 2) with T = T (X).Calculate the sample mean for each neighborhood a j , j = 1, . . ., N , based on the respective subset, m j (H 0 ) = 1 hj k∈H 0 j x k .
Fix neighborhood a i .For x ∈ a i , let the MD-based measure with the subset given by H 0 be defined as For a new subset H 1 i of a i of size h i with Proof.This proof is very much along the lines of Boudt et al. (2020).The neighborhood a i is fixed.Thus, the matrix which determinant should be minimized regarding i is where Ω is a fixed positive definite covariance matrix.
Since the original proof is not restricted to convex linear combinations, we can use the same proof with the matrices A 1 , A 2 and Ω in place of K 1 , K 2 and Λ and adapted factors w j = k ρ/h i , j = 1, . . ., h i = k/(p + 1), j = h i + 1, . . ., k.
with k = h i + p + 1.1 . , √ N sim }, of similar observations we construct a second spatial grid with each cell consisting of n sim /N sim observations on average.The borders of the areas for the first coordinate are defined as b 1 l = l 20 √ Nsim for l = 0, . . ., √ N sim .Analogously, the borders for the second coordinate are defined as b 2 m = m 20 √ Nsim for m = 0, . . .,

Figure 2 :
Figure2: Simulation scenarios with p = 2 and a 5% contamination rate.On the left hand side the simulation setup 1 is presented with contamination achieved through the swapping process described inErnst and Haesbroeck (2016), N sim = 25 and n sim = 41.The values printed on the left-most panel are corresponding to the parameter δ lm .On the right hand side, setup 2 with ν = 3 is shown with completely random swapping.

Figure 3 :
Figure 3: Different convergence behavior of objective function.A p = 5 dimensional setting with 5% contamination achieved with completely random swapping.Each line is the path of one initial set of H-sets along the C-step iteration in the algorithm in Section 3.

Figure 4 :
Figure 4: Analysis of runtime of the ssMRCD for setup 1 with 5% contamination rate, N sim = 25 and varying parameter values with default values λ = 0.5, p = 5, N = 25 and n = 1681 if not varied.The solid line is representing the mean of 100 repetitions, the edges of the gray band around the mean the 5% and 95% quantile.

Figure 5 :
Figure 5: Outlier detection performance based on the false negative, false positive rate and the F1-score of the ssMRCD outlier detection method for different parameter settings.Each point represents the arithmetic mean and the corresponding bars the 5th and 95th quantile of 100 simulations.For non-varying parameters the default settings are p = 5, N = 25 (comparable to n i ≈ 67) and λ = 0.5.

Figure 6 :Figure 7 :
Figure 6: False positive and false negative rate for all four outlier detection methods with contamination achieved through completely random switching for different scenarios.Each point represents the mean of 100 repetitions.

Figure 9 :Figure 10 :
Figure 9: F1-Score for all four outlier detection methods with contamination achieved through the switching method of Ernst and Haesbroeck (2016) for different scenarios.Each point represents the mean of 100 repetitions.

Figure 11 :
Figure11: On the left hand side, the countours of Austria and its districts are shown together with the imposed grid structure for the ssMRCD neighborhoods.Here, singular points are assigned to a neighboring neighborhood.For each neighborhood a i the index i is placed at the center, and the tolerance ellipses of corresponding correlation matrices are plotted along the first and second eigenvector coordinate of T , which can be seen in the upper left corner as reference.The biplot of T is shown on the right hand side.The observations and ellipses are colorized according to their neighborhoods.

Figure 12 :
Figure 12: Distance-distance plots with the outlyingness scores of EH (next distance), of LOF (local outlier factor) and of F (isolation degree) against the next distance of ssMRCD.Observations colored in orange are global outliers based on the robust MD with the MCD as covariance estimator.At the margins the distribution of the different outlyingness scores are depicted by histograms.

Figure 13 :
Figure 13: Upper part: Weather stations classified as outliers colorized according to their outlyingness score in relation to the cut-off value (OC) for all four methods and their global outlyingness.Lower part: in panels a)-d) four exemplary weather stations are selected to show differences on methodologies.Each variable is scaled to range [0, 1].The values of the selected outliers are colourized in orange, the corresponding 10 nearest weather stations in grey.

Figure 14 :
Figure 14: Altitude map of Austria with all weather stations and four selected outliers in red.Each red-dashed ellipse indicates the k nearest neighbors with whom the outlier is compared.
a.The MRCD location estimator μn has the finite sample breakdown point min(h, n − h + 1)/n.b.The MRCD covariance estimator Σn has the finite sample explosion breakdown point (n − h + 1)/n.c.The MRCD covariance estimator Σn has the finite sample implosion breakdown point 1.

Theorem 3 .
The location estimators μSSM,n,i N i=1 of the spatially smoothed MRCD have the finite sample breakdown point * n ( μSSM,n,iN i=1 ; X n ) = min i=1,...,N min(n i − h i + 1, h i )/n i .

Theorem 4 .
Given a fixed and regular target matrix T , the finite sample implosion breakdown point of the spatially smoothed MRCD covariance estimators ΣSSM,n,i − h i + 1)/n i .
We can change m ≤ n − h observations arbitrarily and denote the resulting matrix as X n,m = (x * 1 , . . ., x * m , x m+1 , . . ., x n ) , where x * 1 , . . ., x * m are the exchanged observations (w.l.o.g.placed in the first m rows).The supremum being infinite is equivalent to