Subset Multivariate Collective And Point Anomaly Detection

In recent years, there has been a growing interest in identifying anomalous structure within multivariate data streams. We consider the problem of detecting collective anomalies, corresponding to intervals where one or more of the data streams behaves anomalously. We first develop a test for a single collective anomaly that has power to simultaneously detect anomalies that are either rare, that is affecting few data streams, or common. We then show how to detect multiple anomalies in a way that is computationally efficient but avoids the approximations inherent in binary segmentation-like approaches. This approach, which we call MVCAPA, is shown to consistently estimate the number and location of the collective anomalies, a property that has not previously been shown for competing methods. MVCAPA can be made robust to point anomalies and can allow for the anomalies to be imperfectly aligned. We show the practical usefulness of allowing for imperfect alignments through a resulting increase in power to detect regions of copy number variation.


Introduction
In this article, we consider the challenge of estimating the location of both collective and point anomalies within a multivariate data sequence. The field of anomaly detection has attracted considerable attention in recent years, in part due to an increasing need to automatically process large volumes of data gathered without human intervention. See [6] and [21] for comprehensive reviews of the area. [6] categorises anomalies into one of three categories: global, contextual, or collective. The first two of these categories are point anomalies, i.e. single observations which are anomalous with respect to the global, or local, data context respectively. Conversely, a collective anomaly is defined as a sequence of observations which together form an anomalous pattern.
In this article, we focus on the following setting: we observe a multivariate time series x 1 , ..., x n ∈ R p corresponding to n observations observed across p different components. Each component of the series has a typical behaviour, interspersed by time windows where it behaves anomalously. In line with the definition in [6], we call the behaviour within such a time window a collective anomaly. Often the underlying cause of such a collective anomaly will affect more than one, but not necessarily all, of the components. Our aim is to accurately estimate the location of these collective anomalies within the multivariate series, potentially in the presence of point anomalies. Examples of applications in which this class of anomalies is of interest include but are not limited to genetic data [2,15] and brain imaging data [22,1]. In Genetics, it is of interest to detect regions of the genome containing an unusual copy number. Such genetic variations have been linked to a range of diseases including cancer [8]. To detect these variations, a copy number log-ratio statistics is obtained for all locations along the genome. Segments in which the mean significantly deviates from the typical mean, 0, are deemed variations [2]. Jointly analysing the data of multiple individuals can allow for the detection of shared and even weaker variations [15]. In brain data analysis, sudden shocks can lead to certain parts of the brain exhibiting anomalous activity [1] before returning to the baseline level.
Whilst it may be mathematically convenient to assume that anomalous structure occurs contemporaneously within the multivariate sequence, in practice one might expect some time delays (i.e. offsets or lags), as illustrated by Figure 1. In this article we will consider two different scenarios for the alignment of related collective anomalies across different components. The first, idealised setting, is that concurrent collective anomalies perfectly align. That is we can segment our time series into windows of typical and anomalous behaviour. For each anomalous window the data from a subset of components will be collective anomalies. For some applications however, it is more realistic to assume that concurrent collective anomalies start and end at similar but not necessarily identical time points -the second setting considered in this paper.
Current approaches aimed at detecting collective anomalies can broadly be divided into state space approaches and (epidemic) changepoint methods. State space models assume that a hidden state, which evolves following a  Markov chain, determines whether the time series' behaviour is typical or anomalous. Examples of state space models for anomaly detection can be found in [2] and [24]. These models have the advantage of providing very interpretable output in the form of probabilities of certain segments being anomalous. However, they are often slow to fit and typically require prior parameters which can be difficult to specify. The epidemic changepoint model provides an alternative detection framework, built on an assumption that there is a typical behaviour from which the model deviates during certain windows. Early contributions in this area include the work of [18]. Each epidemic changepoint can be modelled as two classical changepoints, one away from and one returning back to the typical distribution. Epidemic changepoints can therefore be inferred by using classical changepoint methods such as PELT [16], Binary Segmentation [23], or Wild Binary Segmentation [13] if the data is univariate and Inspect [26] or SMOP [20] if the data is multivariate. We note, however, that this approach can lead to sub-optimal power, as it is unable to exploit the fact that the typical parameter is shared across the non-anomalous segments [12].
Many epidemic changepoint methods are based on the popular circular binary segmentation (CBS) by [19], an epidemic version of binary segmentation. A cost-function based approach for univariate data, CAPA, was introduced by [12]. However, within a multivariate setting, the theoretically detectable changes can potentially be very sparse, with a few components exhibiting strongly anomalous behaviour. Alternatively they might also be dense, i.e. with a large proportion of components exhibiting potentially very weak, anomalous behaviour [5,15]. A range of CBS-derived methods for detecting epidemic changes has therefore been proposed such as Multisample Circular Binary Segmentation [27] for dense changes, LRS [14] for sparse changes, and higher criticism [9] based methods like PASS [15] for both types of changes.
This article makes three main contributions, the first of which is the derivation of a moment function based test for the presence of subset multivariate collective anomalies with finite sample guarantees on false positives and power against both dense and sparse changes. The second contribution is the introduction of a new algorithm Multi-Variate Collective And Point Anomalies (MVCAPA), capable of detecting both point anomalies and potentially lagged collective anomalies in a computationally efficient manner. Finally, the article provides finite sample consistency results for MVCAPA under certain settings which are independent of the number of collective anomalies K.
The article is organised as follows: We begin by introducing our modelling framework for (potentially lagged) multivariate collective anomalies in Section 2.1 and define the penalised negative saving statistic to treat the restricted case of at most one single anomalous segment without lags. Section 2.2 introduces penalty regimes, prior to examining their power in Section 2.3. We then turn to consider the problem of inferring multiple collective anomalies as well as point anomalies in Section 3, before discussing computational aspects in Section 4, and establishing finite sample consistency results in Section 5. We extend penalised saving statistic to point anomalies in Section 6 and compare MVCAPA to other approaches in Section 7. We conclude the article by applying MVCAPA to CNV data in Section 8. All proofs can be found in the supplementary material at the end of this paper. MVCAPA has been implemented in the R package anomaly which is available from https://github.com/Fisch-Alex/anomaly.

Penalised Cost Approach
We begin by focussing on the case where collective anomalies are perfectly aligned. We consider a p-dimensional data source for which n time-indexed observations are available. A general model for this situation is to assume that the observation x (i) t ∈ R, where 1 ≤ t ≤ n and 1 ≤ i ≤ p index time and components respectively, has probability density function f i (x, θ (i) (t)) and that the parameter, θ (i) (t), depends on whether the observation is associated with a period of typical behaviour or an anomalous window. We let θ (i) 0 denote the parameter associated with component i during its typical behaviour. Let K be the number of anomalous windows, with the k-th window starting at s k + 1 and ending at e k and affecting the subset of components denoted by the set J k . We assume that anomalous windows do not overlap, so e k ≤ s k+1 for k = 1, . . . , K − 1. If collective anomalies of interest are assumed to be of length at least l ≥ 1, we also impose e k − s k ≥ l. Our model then assumes that the parameter associated with observation x if s 1 < t ≤ e 1 and i ∈ J 1 , . . .
We start by considering the case where there is at most one collective anomaly, i.e. where K ≤ 1, and introduce a test statistic to determine whether a collective anomaly is present and, if so, when it occurred and which components were affected. The methodology will be generalised to multiple collective anomalies in Section 3. Throughout we assume that the parameter, θ 0 , representing the sequence's baseline structure, is known. If this is not the case, an approximation can be obtained by estimating θ 0 robustly over the whole data, as in [12].
Given the start and end of a window, (s, e), and the set of components involved, J, we can calculate the log-likelihood ratio statistic for the collective anomaly. To simplify notation we introduce a cost, The log-likelihood ratio statistic is then i∈J S i (s, e). As the start or the end of the window, or the set of components affected, are unknown a priori, it is natural to maximise the log-likelihood ratio statistic over the range of possible values for these quantities. However, in doing so we need to take account of the fact that different J will allow different numbers of components to be anomalous, and hence will allow maximising the log-likelihood, or equivalently minimising the cost, over differing numbers of parameters. This suggests penalising the log-likelihood ratio statistic differently, depending on the size of J. That is we calculate where P (.) is a suitable positive penalty function of the number of components that change, and l is the minimum segment length. We will detect a change if (2) is positive, and estimate its location and the set of components that are anomalous based on the values of s, e, and J that give the maximum of (2). Whilst we have motivated this procedure based on a log-likelihood ratio statistic, the idea can be applied with more general cost functions. For example if we believe anomalies result in a change in mean then we could use cost functions based on quadratic loss. Suitable choices for the penalty function, P (·), will be discussed in Section 2.2.
The choice of β 1 , ..., β p and α has a significant impact on the performance of the statistic. For example, if we set β i = 0 for i = 2, . . . , p, then our approach will assume all components are anomalous in any anomalous segment. Clearly, α and β 1 are only well specified up to their sum and α can be absorbed into β 1 without altering the properties of our statistic. However, not doing so can have computational advantages. Indeed, it removes the need of sorting when all the β i s are identical, which can be appropriate in certain settings discussed in the next section.

Choosing Appropriate Penalties
We now turn to the problem of choosing appropriate penalties for the methodology introduced in the previous section. Penalties are typically chosen in a way which controls false positives under the null hypothesis that no anomalous segments are present. However, the setting in which the penalty is to be used can also play an important role. In particular, it is typically required that under the null hypothesis where A is a constant and ψ := ψ(p, n) increases with n and/or p, depending on the setting. For example, in panel data [3] or brain imaging data [1], the number of time points n may be small but we may have data from a large number of variates, p. Setting ψ(p, n) ∝ log(p) is therefore a natural choice so that the false positive probability tends to 0 as p increases. In a streaming data context, the number of sampled components p is typically fixed, while the number of observations n increases. Setting ψ(p, n) ∝ log(n) is then a natural choice.
In an application where both n and p are large, or can increase, setting ψ(p, n) ∝ log(n) + log(p), would be a natural choice. Our approach to constructing appropriate penalty regimes is to construct a bound on the probability that k i=1 (S (i) (x s+1:e )−β i )−α > 0 under the null separately for all k = 1, . . . , p and all windows (s, e). We will derive explicit results for the case in which collective anomalies are characterised by an anomalous mean -arguably one of the most common settings in applications -and point to generalisations where appropriate. As mentioned in the previous section, we assume throughout this section that the parameter θ 0 of the typical distribution is known.
Assume without loss of generality that the typical mean is equal to 0 and the typical standard deviation is equal to 1. Under the the null model, which we denote M 0 , we therefore have that where t,c is Gaussian white noise. Denote the mean for data sequence m within a window (s, e) asx (m) s+1:e . The cost function obtained if we model the data in this way is the residual sum of squares. Under this cost function, for any 0 ≤ s < e ≤ n, Furthermore, under M 0 , this saving is χ 2 1 -distributed. We present three distinct penalty regimes that give valid choices for the penalties for this setting, each with different properties. All penalty regimes will be indexed by a parameter ψ which corresponds to the exponent of the probability bound, as in (3). A first strategy for defining penalties consists of just using a single global penalty.
The following result based on standard tail bounds of the χ 2 p distribution shows that this penalty regime controls the overall type 1 error rate at the desired level Under this penalty, we would infer that any detected anomaly region will affect all components. This is inappropriate, and is likely to lead to a lack of power, if we have anomalous regions that only affect a small number of components. For this type of anomalies, the following regime offers a good alternative as it typically penalises fitting collective anomalies with few components substantially less: Penalty Regime 2: α = 2(1 + )ψ and β m = 2 log(p), for 1 ≤ m ≤ p.
The following results based on tail bounds for the χ 2 1 -distribution shows that this penalty controls false positives: Proposition 2. Let x follow M 0 and letK denote the number of inferred collective anomalies under penalty regime 2. Then, there exists a constant A 2 such that P{K = 0} ≥ 1 − A 2 n 2 e −ψ .
Comparing penalty regime 2 with penalty regime 1 we see that it has a lower penalty for small m, but a much higher penalty for m >> p/ log p. As such it has higher power against collective anomalies affecting few components, but low power if the collective anomalies affect most components. Taking the point-wise minimum between both penalties therefore provides a balanced approach, providing power against both types of changes (see Figure 2.) Moreover, this approach can be generalised to settings other than the change in mean case provided the distribution of the savings is sub-exponential under the null distribution. Indeed, the first penalty regime is derived from a Bernstein bound and the second one from an exponential Chernoff bound on the tail.
For the special case considered here, in which the savings follow a χ 2 1 distribution under the null hypothesis a third penalty regime can be derived: Moreover, as can be seen from Figure 2, this penalty regime provides a good alternative to the other penalty regimes, especially in intermediate cases. It can be generalised to other distributions provided quantiles of S(s, e) can be estimated and exponential bounds for (S(s, e) − a) + can be derived under the null hypothesis.
In practice, to maximise power, we choose α, β 1 , ..., β p so that the resulting penalty function P (k), is the point-wise minimum of the penalty functions P 1 (k), P 2 (k), and P 3 (k) resulting from using penalty regimes 1, 2, and 3 respectively. We call this the composite regime. It is a corollary from Propositions 1, 2, and 3, that this composite penalty regime achieves P{K = 0} ≥ 1 − (A 1 + A 2 + A 3 )n 2 e −ψ when x follows M 0 . It should be noted that the n 2 term comes from a Bonferroni correction over all possible start and end points. Fixing a maximum segment length M therefore reduces the n 2 to nM .
Whilst the above propositions give guidance regarding the shape of the penalty function P (·) as well as a finite sample bound on the probability of false positives they do not give an exact false positive rate. A user specified rate can nevertheless be obtained by scaling the penalties β 1 , ...β p and α by a constant. This single constant can then be tuned using simulations on data exhibiting no anomalies.

Results on Power
We now examine the power of the penalised saving statistic at detecting an anomalous segment x s+1 , ..., x e when using the thresholds defined in the previous section. In particular, we examine its behaviour under a fixed n and large p regime. It is well known [15,9,5], that different regimes determining the detectability of collective anomalies apply in this setting depending on the proportion of affected components. We follow the asymptotic parametrisation of [15] and therefore assume that Typically [15], changes are characterised as either sparse or dense. In a sparse change, only a few components are affected. Such changes can be detected based on the saving of those few components being larger than expected after accounting for multiple testing. The affected components therefore have to experience strong changes to be reliably detectable. On the other hand, a dense change is a change in which a large proportion of components exhibits anomalous behaviour. A well defined boundary between the two cases exist with ξ ≤ 1 2 corresponding to dense ξ > 1 2 and corresponding to sparse changes [15,10]. The changes affecting the individual components can consequently be weaker and still be detected by averaging over the affected components. Depending on the setting, the change in mean is parametrised by r p ∈ R in the following manner: Both [15] and [5] derive detection boundaries for r p , separating changes that are too weak to be detected from those changes strong enough to be detected. For the case in which the standard deviation in the anomalous segment is the same as the typical standard deviation, the detectability boundaries correspond to for the sparse case ( 1 2 < ξ < 1) and ρ + = 1 2 − ξ for the dense case (0 ≤ ξ ≤ 1 2 ). The following proposition establishes that the penalised saving statistic has power against all sparse changes within the detection boundary, as well as power against most dense changes within the detection boundary Then the number of collective anomalies,K, estimated by MVCAPA using the composite penalty regime with ψ = 2 log(n) + 2 log(log(p)) on the data x 1 , ..., x n , satisfies Whilst the above assumed n to be fixed, the result also holds if n = o(p). Moreover, rather than requiring µ i to be 0, or a common value µ, it is trivial to extend the result to the case where µ 1 , ..., µ p are i.i.d. random variables whose magnitude exceeds µ with probability p −ξ . It is worth noticing that the third penalty regime is required to obtain optimal power against the intermediate sparse setting 1 2 < ξ ≤ 3 Inference for Multiple Anomalies

Inference for K Collective Anomalies and Point Anomalies
We now turn to the problem of generalising the methodology introduced in Section 2.1 to infer multiple collective anomalies. We will then borrow methodology from [12] to incorporate point anomalies within the inference. A natural way of extending the methodology introduced in Section 2.1 to infer multiple collective anomalies in various ways, is to maximise the penalised saving jointly over the number and location of potentially multiple anomalous windows. That is we inferK, ŝ 1 ,ê 1 ,Ĵ 1 ,..., ŝK,êK,ĴK by directly maximisinĝ subject toê k −ŝ k ≥ l andê k ≤ŝ k+1 . It is well know from the literature that many methods designed to detect changes, or collective anomalies, are vulnerable to point anomalies [11,12]. Distinguishing between point and collective anomalies only makes sense if they are different, that is collective anomalies are longer than length 1. Under such an assumption, our approach can be extended to model both point and collective anomalies.
Borrowing the approach of [12], a point anomaly can be modelled as an epidemic changepoint of length 1 occurring during a segment of typical behaviour. Joint inference on collective and point anomalies can then be performed by maximising the penalised savinĝ with respect toK, ŝ 1 ,ê 1 ,Ĵ 1 ,..., ŝK,êK,ĴK , and the set of point anomalies O, subject toê k −ŝ k ≥ l, Here, S (x t ) is the saving of introducing a point anomaly.
For example, when collective anomalies are characterised by changes in mean The penalised savings S (·) and S (·), as we assume point anomalies to be sparse. Suitable choices for β will be discussed in the next subsection.

Penalties for Point Anomalies
Penalties for point anomalies can be chosen with the aim of controlling false positives under the null hypothesis, that no collective or point anomalies are present. When collective anomalies are characterised by a change in mean the null hypothesis is identical to that in Section 2.2. The following proposition holds for any penalty β : Proposition 5. Let M 0 hold andÔ denote the set of point anomalies inferred by MVCAPA using β as penalty for point anomalies. Then, there exists a constant A such that This suggests setting β = 2 log(p) + 2ψ, where ψ is as in Section 2.2.

Computation
We now turn to the problem of maximising the penalised saving introduced in the previous section. The standard approach to extend a method for detecting an anomalous window to detecting multiple anomalous windows is through circular binary segmentation (CBS, [19]) -which repeatedly applies the method for detecting a single anomalous window or point anomaly. Such an approach is equivalent to using a greedy algorithm to approximately maximise the penalised saving and has computational cost of O(M n), where M is the maximal length of collective anomalies and n is the number of observations. Consequently, the runtime of CBS is O(n 2 ) if no restriction is placed on the length of collective anomalies. We will show in this section that we can directly maximise the penalised saving by using a pruned dynamic programme. This enables us to jointly estimate the anomalous windows, at the same or at a lower computational cost than CBS. 1. Only collective anomalies. The penalised saving defined in (5) can be maximised exactly using a dynamic programme. Indeed, defining C(m) to be the largest penalised saving for all observations up to, and including, the time m, the following relationship holds: It should be noted that calculating S (t, m) is, on average, an O(p log(p)) operation, since it requires sorting the savings made from introducing a change in each component. This sorting is not required when all β i are identical. Setting all β i to the same value, as in penalty regime 1, therefore reduces the computational cost to O(p). For a maximum segment length M , the computational cost of this dynamic programme approach scales like O(M n). If no maximum segment length is specified, it therefore scales quadratically in n. In this setting, it therefore achieves the same run-time as CBS in both cases.
2. Collective and point anomalies. The saving in (6) can be minimised exactly via a slight modification of the previous dynamic programme. Indeed, writing C(m) for the largest penalised saving of all observations up to and including time m, the relationship holds for C(0) = 0. The previous observations regarding the computational complexity in M and n remain valid.
3. Pruning the dynamic programme. Solving the whole dynamic programme if no maximum segment length is specified has a computational cost increasing quadratically in n. However, the solution space of the dynamic programme can be pruned in a fashion similar to [16] and [12] to reduced this computational cost. Indeed, the following proposition holds: holds for all x and a, b, c such that b − a ≥ l and c − b ≥ l. Then, if for some t there exists an m ≥ t − l such that A wide range of cost functions, such as the negative log-likelihood and the sum of squares satisfy the condition required by the above proposition. The proposition implies that if for some t there exists an m ≥ t − l such that holds, t can be dropped as an option from the dynamic programme for all steps after step m + l, thus reducing the cost of the algorithm. As a result of this pruning we found the runtime of MVCAPA to be close to linear in n, when the number of collective anomalies increased linearly with n.

Accuracy of Detecting and Locating Multiple Collective Anomalies
Whilst we have shown our method has good properties when detecting a single anomalous window, it is natural to ask whether the extension to detecting multiple anomalous windows will be able to consistently infer the number of anomalous windows and accurately estimate their locations. Developing such results for the joint detection of sparse and dense collective anomalies is notoriously challenging, as can be seen from the fact that previous work on this problem [15] has not provided any such results. Another new feature of this proof is that the results allow for the number of anomalous segments K to increase, whereas most results in the related changepoint literature (e.g. [13]) assume K to be fixed.
Consider a multivariate sequence x 1 , ..., x n ∈ R p , which is of the form x t = µ(t) + η t , where the mean µ(t) follows a subset multivariate epidemic changepoint model with K epidemic changepoints in mean. For simplicity, we assume that within an anomalous window all affected components experience the same change in mean, and that the noise process is i.i.d. Gaussian although the results extend to sub-Gaussian noise, i.e.
Consider also the following choice of penalty, which is very similar to the the point-wise minimum between penalty regimes 1 and 2: Here, C is a constant, ψ := ψ(n, p) sets the rate of convergence and the threshold is defined as the threshold separating sparse changes from dense changes. Anomalous regions can be easier or harder to detect depending on the strength of the change in mean characterising them and the number of components they affect. This intuition can be quantified by which we define to be the signal strength of the kth anomalous region. The following consistency result then holds Theorem 1. There exists a global constant C such that the inferred partition τ = {(ŝ 1 ,ê 1 ,Ĵ 1 ), ..., (ŝK,êK,ĴK)} obtained by applying MVCAPA using the penalty regime specified in (8) on data x which follows the distribution specified in (7) satisfies provided that The result is proved in the appendix. This finite sample result holds for a fixed C, which is independent of n, p, K, and/or k . When ψ = log(p), the threshold k * is identical to that in [15]. However, if φ is chosen to increase with log(n), so will k * . This formalises the intuition that when n >> p, all changes are in some sense sparse.

Incorporating Lags
So far, we have assumed that the collective anomalies were characterised by the model specified in (1), which assumes all anomalous windows are perfectly aligned. In some applications, such as the vibrations recorded by seismographs at different locations, certain components will start exhibiting atypical behaviour later and/or return to the typical behaviour earlier. An example can be found in Figure 1b. It is possible to extend the model in (1) to allow for this behaviour by allowing lags in the start or end of each window: Here the start and end lag of the ith component during the kth anomalous window are denoted, respectively, by k ≤ w, for some maximum lag-size, w, and satisfy s k + d The remaining notation is as before.
We can extend our penalised likelihood/penalised cost approach to this setting. We begin by extending the test statistic defined in Section 2.1 and the inference procedure in Section 3 to allow for lags of up to w, before discussing modifications to the penalties. We conclude this section by introducing ways of making the method computationally efficient, leading to a computational cost increasing only linearly in w.

Extending the Test Statistic
The statistic introduced in Section 2.1 can easily be extended to incorporate lags. The only modification this requires is to re-define the saving S i (s, e) to be where w is the maximal allowed lag, and γ is a penalty for adding a lag. We then infer O,K, ŝK,êK,dK,fK,ĴK by directly maximising the penalised savinĝ with respect toK, ŝ 1 ,ê 1 ,d 1 ,f 1 ,Ĵ 1 ,..., ŝK,êK,d K ,f K ,ĴK , and the set of point anomalies O, subject to

Modifying the Penalties
As discussed in Section 2.2, the main purpose of the penalties is to account for multiple testing. Introducing lags means searching over more possible start and end points, i.e. testing more hypotheses. Consequently, increased penalties are required. The following modified version of penalty regime 2 can be used to account for lags: Here > 0 is a (small) constant. The following proposition shows that penalty regime 2' controls false positives at the desired level.
An alternative to this penalty regime consists of using penalty regime 2, but setting the penalty γ = 2(1 + ) log(w + 1).
Unlike penalty regime 2, which is based on a tail bound, regimes 1 and 3 are based on Bernstein-type mean bounds. However, the MGF of the maximum of correlated chi-squared distributions is not analytically tractable. Consequently we limited ourselves to extending regime 2.

Computational Considerations
The dynamic programming approach described in Section 4 can also be used to minimise the penalised negative saving in Equation (11). Solving the dynamic programme requires the computation of S i (t, m) for 1 ≤ i ≤ p for all permissible t at each step of the dynamic programme. Computing these savings ex nihilo every time leads to the computational cost of the dynamic programme to scale quadratically in (w + 1).
However, it is possible to reduce the computational cost of including lags by storing the savings These can then be updated in each step of the dynamic programme at a cost of at most O(np). From these, it is possible to calculate all S i (t, m) required for a step of the dynamic programme in just O(np(w + 1)) comparisons. This reduces the computational cost of each step of the dynamic programme to O(pn(w + 1) + pn log(p)) or O(pn(w + 1)), depending on the type of penalty used. Crucially, only the comparatively cheap operations of allocating memory and finding the maximum of two numbers now increase with w + 1 and even this relationship is only linear.

Pruning the Dynamic Programme
Even when lags are included in the model, the solution space of the dynamic programme can still be pruned in a fashion similar to [16] and [12]. Indeed, the following generalisation of Proposition 6 holds: must also holds for all m ≥ m + l + w.

Simulation Study
We now compare the performance of MVCAPA to that of other popular methods. In particular, we compare ROC curves, precision, as well as the runtime with PASS [15] and Inspect [26,25]. PASS [15] uses higher criticism in conjunction with circular binary segmentation [19] to detect subset multivariate epidemic changepoints. Code is available from the author's website. Inspect [26] uses projections to find sparse classical changepoints.
The comparison was carried out on simulated multivariate time series with n = 5000 observations for p components with i.i.d. N (0, 1) noise, for a range of values of p. To these, collective anomalies affecting k components occurring at a geometric rate of 0.001 (leading to an average of about 5 collective anomalies per series) were added. The lengths of these collective anomalies are i.i.d. Poisson-distributed with mean 20. Within a collective anomaly, the start and end lags of each component are drawn uniformly from the set {0, ..., w}, subject to their sum being less than the length of the collective anomaly. Note that w = 0 implies the absence of lags. The means of the components during the collective anomaly are drawn from an N (0, σ 2 )-distribution. In particular we considered the following cases, emulating different detectable regimes introduced in Section 2.3.

3.
A regime close to the boundary between sparse and dense changes, i.e. k = 2 when p = 10 and k = 6 when p = 100 with σ = log(p) and w = 0.

4.
A regime close to the boundary between sparse and dense changes, but with lagged collective anomalies, i.e. the same as 3 but with w = 10.
This analysis was repeated with 5 point anomalies distributed N (0, 8 log(p)). The log(p)-scaling of the variance ensures that the point anomalies are anomalous even after correcting for multiple testing over the p different components.

ROC Curves
We obtained ROC curves by varying the threshold parameters of Inspect and PASS and by rescaling α, β , β 1 , ..., β p for MVCAPA. The curves were obtained over 1000 simulated datasets. For MVCAPA, we typically set w = 0, but also tried w = 10 and w = 20 for the third and fourth setting. We used and rescaled the composite penalty regime (Section 2.2) for w = 0 and penalty regime 2' for w > 0. We also set the maximum segment lengths for both MVCAPA and PASS to 100 and the minimum segment length of MVCAPA to 2. The α 0 parameter of PASS, which excludes the α 0 −1 lowest p-values from the higher criticism statistic to obtain a better finite sample performance (see [15]) was set to k or 5, whichever was the smallest. For MVCAPA and PASS, we considered a detected segment to be a true positive if its start and end point both lie within 20 observations of that of a true collective anomalies' start and end point respectively. For Inspect, we considered a detected change to be a true positive if it was within 20 observation of a true start or end point. When point anomalies were added to        the data, we considered segments of length one returned by PASS to be point anomalies to make the comparison with MVCAPA fairer.
The results for three of settings considered can be found in Figures 3 to 5. The qualitatively similar results for the second setting can be found in the supplementary material. We can see that Inspect usually does worst, especially when changes become dense, which is no surprise given the method was introduced to detect sparse changes. We additionally see that MVCAPA generally outperforms PASS. This advantage is particularly pronounced in the case in which exactly one component changes. This is a setting which PASS has difficulties dealing with due to the convergence properties of the higher criticism statistic at the lower tail [15]. PASS outperformed MVCAPA in the second setting for p = 10, when it was assisted by a large value of α 0 , which considerably reduced the number candidate collective anomalies it had to consider. Figures 4 and 5, show that MVCAPA performs best when the correct maximal lag is specified. They also demonstrate that specifying a lag and therefore overestimating the lag when no lag is present adversely affects performance of MVCAPA. However, when lags are present, over-estimating the maximal lag appears preferable to underestimating it. Finally, the comparison between Figures 5c and 5d shows that the performance of MVCAPA is hardly affected by the presence of point anomalies, unlike that of Inspect and, to a lesser extent, PASS, whose performance is adversely affected.

Precision
We compared the precision of the three methods by measuring the accuracy (in mean absolute distance) of true positives. Only true positives detected by all methods were taken into account to avoid selection bias. We used the default parameters for MVCAPA and PASS, whilst we set the threshold for Inspect to a value leading to comparable number of true and false positives. To ensure a suitable number of true positives for Inspect we doubled sigma in the second scenario. The results of this analysis can be found in Figure 6 and show that MVCAPA is usually the most precise approach, exhibiting a significant gain in accuracy against PASS. When comparing the influence of the user-specified maximal lag, we note that specifying he correct maximal lag gives the best performance. We also note that over-and under-estimating the lag have a similar effects on the accuracy.

Runtime
We compared the scaling of the runtime of MVCAPA, PASS, and Inspect in both the number of observations n, as well as the number of components p. To evaluate the scaling in n we set p = 10 and varied n on data without any anomalies. We repeated this analysis with collective anomalies appearing (on average) every 100 observations. The results of these two analyses can be found in Figures 7a and 7b respectively. We note that the slope of MVCAPA is very close to 2, in the anomaly-free setting and very close to 1 in the setting in which the number of anomalies increases linearly with the number of observations, suggesting quadratic and linear behaviour respectively, whilst the slopes of PASS and Inspect are close to 2 and 1 respectively in both cases.
Turning to the scaling of the three methods in p, we set n = 100 and varied p. The results of this analysis can be found in Figure 7c. We note that the slopes of all methods are close to 1 suggesting linear behaviour. However, Inspect becomes very slow once p exceeds a certain threshold.

Application
We now apply MVCAPA to extract copy number variations (CNVs) from genetics data. The data consists of a log-likelihood ratio statistic evaluated along the genome which measures the likelihood of a CNV. A multivariate approach to detecting CNVs is attractive because they are often shared across individuals. By borrowing signal across individuals we should gain power for detecting CNVs which have a weak signal. However, as we will become apparent from our results, shared variations do not always align perfectly.
In this section we re-use the design of [2] to compare MVCAPA with PASS. The only difference is that we set the maximum segment length for MVCAPA and PASS to 100, whilst [2] used 200. To investigate the potential benefit of allowing for lags, we repeated the experiment for MVCAPA both with w = 0 (i.e. not allowing for lags) and w = 40. Since n >> p in this application, we used the sparse penalty setting for MVCAPA.
The exact ground truth is unknown. Indeed, it is beyond the scope of this paper to differentiate between a false positive and a currently unknown CNV. We can nevertheless compare different methods by how accurately they detect known copy number variations for a given size. Like [2], we used known CNVs from the HapMap project [7] as true positives and tuned the penalties and thresholds in such a way that 4% of the genome was flagged up as anomalous. For MVCAPA this involved scaling the penalties α, β 1 , ..., β p by a constant, as discussed in the final paragraph of Section 2.2.
The results of this analysis can be found in Figure 8. These tables show that MVCAPA compares favourably with PASS. We can also see that allowing for lags generally led to a better performance of MVCAPA, thus suggesting non-perfect alignment of CNVs across individuals. Moreover, MVCAPA was very fast taking 5 seconds to analyse the longer genome on a standard laptop when we did not allow lags, and 10 seconds when we allowed for lags. The R implementation of PASS, on the other hand, took 17 minutes.
The probability that the segment (i, j) is not flagged up as anomalous is given by where inequality follows from the bounds on the chi-squared distribution in [17]. A Bonferroni correction over all possible pairs i, j then finishes the proof.

Proof of Proposition 2
, noting that they are all independent and The probability that this segment i, j will not be considered anomalous is for all λ > 0, where the second inequality corresponds to a Chernoff bound. Next set λ = 1 where the first inequality exploits well known tail bounds of the normal distribution. Similarly, for p = 1: Bonferroni correction over all possible pairs i, j then finishes the proof.
We will now use the following lemma, which shows that (Y c − a m ) + is sub-gamma.
Using Lemma 1 and the bounds on sub-gamma random-variables in [4], we have that A Bonferroni correction over all possible pairs i, j then finishes the proof.

Proof of Proposition 4
Proof of Proposition 4: We will show that the penalised saving for the true anomalous segment is positive with probability converging to 1 as p increases. By the definition of signal strength, the distribution of the true anomalous segment's penalised saving does not depend on the length, s − e, of the segment. Thus, we assume, without loss of generality, that e = s + 1 and treat the cases 0 < ξ ≤ 1 2 , 1 2 < ξ < 3 4 , and 3 4 < ξ < 1, separately. We write X i := x (i) e , for 1 ≤ i ≤ p.
Remember that the composite penalty used is the minimum between regimes 1, 2, and 3. It is therefore sufficient to show that the saving will exceed the penalty specified by one of these three regimes (regime 1 in this case) at some point. By definition, Furthermore, for all k such that 1 ≤ k ≤ p. We therefore only have to show that there exists some sequence of integers k p such that the right hand side converges to 1 as p → ∞. Note that Hoeffding's inequality implies that Setting k p = 1 2 p 1−ξ , it is therefore sufficient to show that converges to 1 as p tends to infinity. This is the case if r p < 1 2 ( 1 2 − ξ), which finishes the proof.
Case 2: 3 4 < ξ < 1. By an argument similar to that made for case 1, it is sufficient to show that the saving will exceed the penalty specified by regime 2. We have that: β i ≥ P X (1) > 2ψ + 2 log(p) = 1 − (1 − P (X 1 > 2ψ + 2 log(p))) p By definition, X 1 = (µv 1 + 1 ) 2 , where 1 ∼ N (0, 1) and v 1 ∼ Ber(p −ξ ). We can therefore bound the above by where the second inequality follows from the fact that 1−x ≤ e −x . We consider separately the cases 2ψ + 2 log(p)− 2r p log(p) > 1 and 2ψ + 2 log(p) − 2r p log(p) ≤ 1. In the latter case the above clearly converges to 1 as p goes to infinity. In the former case we can use the lower tail bound P (N (0, 1) , for x > 0 to bound the above by Thus, for a fixed r p > 1 − √ 1 − ξ 2 this converges to 1, as ψ/ log(p) converges to 0.
Case 3: 1 2 < ξ < 3 4 . By an argument similar to that made for case 1, it is sufficient to show that the saving will exceed the penalty specified by regime 3. We assume, without loss of generality, that µ > 0. If r p ≥ 1 4 . Our approach is to define a threshold, b, and a number of excesses,k, such that the number of savings in cost that exceed b will be great thank with probability going to 1 as p increases. We then show that the overall sum of thek largest savings will be greater than the penalty for fittingk components as anomalous.
We introduce the following new random variable: where (x) + denotes the positive part of x. Note that Y i ≤ X i . We also introduce the following four technical lemmata Lemma 2. Let a > 0 and let Z ∼ χ 2 1 . Then, for all positive x ∈ R P (Y i ≥ a + x|Y i ≥ a) ≥ P (Z > a + x|Z ≥ a) .
Lemma 4. Let a k be defined implicitly as P χ 2 1 > a k = k p and let f (·) denote the probability density function of the χ 2 1 distribution. Then Next write b = 8r p log(p) and letk be an integer such that both pP χ 2 1 > b ≤k ≤ p and ak < b. Note that since r p < 1 4 , we have b ≤ 2 log(p) and such ak is guaranteed to exist for sufficiently large values of p. For convenience, writeμ = E (χ 2 1 − ak) + . The following holds: Y (i) ≥ 2ψ + 2 log(p) +kak + pμ + 2 (kak + pμ)(ψ + 2 log(p)) Y (i) ≥ 2ψ + 2 log(p) +kak + pμ + 2 (kak + pμ)(ψ + 2 log(p)) where the first inequality follows from substituting the third penalty regime (using the equality from Lemma 4) and the second inequality follows from conditioning on the number of Y i exceeding b. Next note that .., Zk be i.i.d. χ 2 1 distributed. Lemma 2 then implies that the above exceeds Using the inequality in Lemma 4 and the fact that ψ < log(p) for sufficiently large values of p we can further bound the above by , we can further bound the above by provided p is large enough. Next, note that since b ≥ ak Consequently, we can bound (12) by where the inequality follows from Lemma 3. Given Lemma 5 and the fact that Lemma 4 implies that P χ 2 1 > b + 2bf (b) < 2P χ 2 1 > b (1 + log(p)), we can further bound the above by The arithmetic-mean-geometric-mean-inequality can be used to show that ((a − b) + ) 2 > ((a) + ) The above quantity is therefore bounded by Note that ifk ≥ pP χ 2 1 > b + p 1 2 −2rp+δ for some δ > 0, theñ with the first inequality following from the fact that the left-hand side is increasing ink, the second one following from the fact that P χ 2 1 > b < P χ 2 2 > b = p −4rp and the last one following from the fact that r p < 1 4 . Consequently, it is sufficient to show that there exists a δ > 0 such that This can be seen from the fact that Note that by standard tail bounds of the normal distribution and the definition of b. Standard Hoeffding bounds show that converges to 1 as p → ∞. Note that for all δ such that r p − ξ + 1 2 > δ, provided p is large enough. This follows from the fact that by standard tail bounds on the normal distribution. Since 0 < r p , the the above dominates p 1−ξ−4rp as p increases. Similarly, because r p − ξ + 1 2 > 0, it dominates p 1−4rp , since, r p < 1 4 and ξ < 3 4 , 1 − r p − ξ > 0 and the above therefore also dominates p 1−rp−ξ . Finally, if δ is such that r p − ξ + 1 2 > δ it must also dominate p   We give the proof of Proposition 8, as Proposition 6 corresponds to as special case. We write The proof of Proposition 8 is then a corollary of the observation that for all d, f < w Indeed, the above inequality shows that This finishes the proof.

Proof of Proposition 7
The proof of Proposition 7 is similar to that of Proposition 2. Let 1 ≤ i ≤ j ≤ n. For this pair (i, j) define: noting that they are all χ 2 1 distributed. Then define Y c = max Y (d,f ) c and note that Y 1 , ..., Y p are independent. The probability that this segment i, j will not be considered anomalous is for all λ > 0, with the second inequality being a Chernoff bound. Now set λ = 1 2 1 1+ and note that The following lemma holds ∼ N (0, 1) for i ≤ t ≤ j. Then there exists a constantÃ such that Using this lemma and setting b = √ 1 + , we can bound (13) by 1 + 2Ã(w + 1) 1 + log(w + 1) log(1 + ) where the second inequality follows from the fact that 1 − x ≤ e −x . Note that where the first inequality follows from the fact that 1 + x ≤ e x , the second inequality follows from the fact that the expression is monotone in log(w + 1), and the third inequality follows from the fact that max(a, b) ≤ a + b for all positive a, b. The above result and the fact that p ≥ 1 imply that the probability of false positives can further be bounded by where the inequality follows from the fact that log(1 + ) ≥ 1+ and 1 ≤ √ 1 + . A Bonferroni correction over all possible pairs i, j then finishes the proof.

Proof of Theorem 1
We define the penalised cost of a segment x i:j under a partition τ = {τ 1 , ...,τK}, whereτ k = (ŝ k ,ê k ,Ĵ k ) to be Here the penalised cost of introducing the kth anomalous window is It should be noted that minimising the penalised cost, is equivalent to maximising the penalised saving. We call the partition which minimises the penalised cost, C (x1:n,τ ), over all feasible partitions,τ , the optimal partition.
We also define the following event sets over all pairs i, j such that 1 ≤ i ≤ j ≤ n where the set W i,j of components with non constant mean in the interval [i, j] is defined as The intuition behind these events is as follows: Events E 1 and E 2 bound the saving obtained from fitting an anomalous region on data belonging to the typical distribution and so ensure no false positives are fitted. Events E 7 , E 8 , E 9 , and E 11 provide bounds on the additional un-penalised cost of splitting a fitted segment in two or merging two existing segments, assuring that anomalous regions are fitted by one rather than multiple adjacent segments. They are assisted by events E 3 and E 5 which bound the additional un-penalised cost incurred by fitting any given segment by a dense change, extending the result to showing the sub-optimality of a collective anomaly being fitted by multiple non-adjacent segments. Events E 4 , E 6 , and E 10 bound the interaction between the signal and the noise thus ensuring that anomalous regions are detected. For brevity, we denote E = ∩E i and note that it occurs with high probability. Indeed, the following Lemma holds: There exists a constant A such that We now define the set of good partitions B C to be It is sufficient to prove the following proposition in order to prove Theorem 1 Proposition 9. Let the assumptions of Theorem 1 hold. Given E holds and C exceeds a global constant, the partition τ 0 minimising the penalised cost C (x 1:n , τ ) satisfies τ 0 ∈ B C The main ideas of the proof of Proposition 9 are that given E: I Each fitted anomalous segment overlaps with at most one true anomalous segment.
II Each fitted anomalous segment overlaps with at least one true anomalous region.
III Each true anomalous segment overlaps with at most one fitted anomalous region, i.e. there exists a bijection between fitted and true segments.
IV Each fitted anomalous segment is close (in the sense of B C ) to the true segment it fits.
We will prove these properties in the following order: First we will prove property II, then IV, then III, and then I. We will then use these to prove Proposition 9. In the subsequent proofs we will use a certain number of technical Lemmata, all proved in the supplementary material.
Throughout these proofs we will use the following two lemmata. The first one describes the increase in unpenalised cost incurred by splitting a fitted segment into two fitted segments and the second one bounds this increase in penalised cost for splitting fitted dense collective anomalies.
Lemma 9. Let i ≤ j < j + 1 ≤ j The following holds given E provided C exceeds some global constant.
We will also use the following lemma which shows that merging two adjacent fitted collective anomalies which are both contained within a true anomalous segment reduces the penalised cost substantially.
Lemma 10. Let i, j , and j be such that there exists a k such that s k < i ≤ j < j + 1 ≤ j ≤ e k . Then, and C (x i:j , 1) ≤ C (x i:j , 1) + C x (j +1):j , 1 − 79 80 C ψ + pψ when |J k | ≤ k * and |J k | > k * respectively , provided C exceeds some global constant and the event E holds.
The proof of part IV will mostly rely on the following three lemmata. The first one shows that fitting a true collective anomaly as anomalous reduces the penalised cost. The second and third one show that if a fitted sparse or dense collective anomaly contains a large number of observations both from a true anomalous segment and from a typical segment, then removing the typical data from the fitted anomaly reduces the penalised cost.
Lemma 11. Let i and j be such that there exists a k such that s k < i ≤ j ≤ e k . Moreover assume that Then given E C (x i:j , J k ) < 0 holds if the kth anomalous window is sparse; i.e. if |J k | ≤ k * ; and C (x i:j , 1) < 0 holds if the kth anomalous window is dense; i.e. if |J k | > k * ; provided C exceeds some global constant and the event E holds.
Lemma 12. Let i and j be such that there exists a k such that either of the following holds: Then the corresponding holds given E 1. if the kth anomalous window is sparse; i.e. if |J k | ≤ k * C (x i:j , J k ) ≥ C (x i:e k , J k ) + 6C(ψ + log(p)) if the kth anomalous window is dense; i.e. if |J k | > k * C (x i:j , 1) ≥ C (x i:e k , 1) + 6C(ψ + pψ)

if the kth anomalous window is sparse
provided C exceeds some global constant and the event E holds.
Lemma 13. Let i and j be such that there exists a k such that the kth anomalous window is dense, |J k | > k * , and either of the following holds: Then the corresponding holds for all J given E
For Part II, we will require the following six lemmata. The first one proves that merging two fitted collective anomalies contained within a truly anomalous segment reduces the overall penalised cost substantially, even if they are non-adjacent. The second one shows that if a fitted collective anomaly contains both typical and atypical data, then the atypical data can be removed from the fitted collective anomaly without increasing the penalised cost too much. The remaining Lemmata are mostly used to show that if a true anomaly has been fitted using the wrong set of components (i.e. fitting a sparse anomaly as a dense one, a dense anomaly as a sparse one, or a sparse anomaly as a sparse anomaly but not with the correct set of components), then it is possible to replace this fitted collective anomaly by one with the right components without increasing the overall penalised cost by too much. Lemma 14. Let i, j , and j be such that there exists a k such that s k < i ≤ j < j + 1 ≤ j ≤ e k . Then, and when |J k | ≤ k * and |J k | > k * respectively, provided C exceeds some global constant and the event E holds.
Lemma 16. Let E hold and C exceed a global constant. Moreover, let i and j be such that there exists a k such that s k < i ≤ j ≤ e k . Then S(x i:j , J) ≥ α (Cψ + C|J| log(p)) for some α > 0 implies that for any sparse J.
Lemma 17. Let i and j be such that there exists a k such that s k < i ≤ j ≤ e k . If the kth anomalous window is sparse; i.e. if |J k | ≤ k * ; and holds for all sparse J, i.e. J satisfying |J| ≤ k * , if C is larger than some global constant and the event E holds.
Lemma 18. Let i and j be such that there exists a k such that s k < i ≤ j ≤ e k . If the kth anomalous window is dense; i.e. if |J k | > k * ; and holds for all sparse J, i.e. J satisfying |J| ≤ k * , if C is larger than some global constant and the event E holds.
Lemma 19. Let the event E hold. Moreover, let i and j be such that there exists a k such that s k < i ≤ j ≤ e k . Then, if the kth anomalous window is sparse; i.e. if |J k | ≤ k * ; holds if C is larger than some global constant For Part I we will then require the following lemmata, which are again concerned with bounding the increase in penalised cost for replacing fitted segments with the wrong number of components by fitted segments with the right number of components.
Lemma 20. Let i and j be such that there exists a k such that s k < i ≤ j ≤ e k . If the kth anomalous window is dense; i.e. if |J k | > k * ; and S (x i:j , J) ≥ 3 10 C (|J| log(p) + ψ) , holds for all sparse J, i.e. J satisfying |J| ≤ k * , if C is larger than some global constant and the event E holds..

Lemma 21.
Let i and j be such that there exists a k such that s k < i ≤ j ≤ e k . If the kth anomalous window is sparse; i.e. if |J| ≤ k * ; and holds for all sparse J, i.e. J satisfying |J| ≤ k * , if C is larger than some global constant and the event E holds.

Property III
We can prove that a fitted segment must overlap with at least one true anomalous segments: Proof of Proposition 10: By contradiction: If (s, e, J) overlaps with no true anomalous it can be shown that the partition τ \ (s, e, J) has lower penalised cost than τ , because of E 1 if J is sparse and E 4 if J is dense.

Property IV
We now prove the following proposition, which shows that if each true anomalous region is fitted by exactly one segment, then the boundaries of that segment are close to the boundaries of the corresponding anomalous region. To this end, we define the set of partitions T 1 as the set of all partitions fitting exactly K anomalous segments in such a way that each fitted anomalous segment overlaps with exactly one true anomalous region and each true anomalous region overlaps with exactly one fitted anomalous segment. More formally, The following proposition then holds: Proposition 11. Let the assumptions of Theorem 1 hold. Given E, if a partition τ ∈ T 1 is optimal it must also satisfy τ ∈ B C , if C exceeds a global constant.
Proof of Proposition 11: Let τ be optimal. Consider the kth true anomalous segment [s k + 1, e k ], which τ fits with the segment (ŝ k ,ê k ,Ĵ). We begin by showing this fitted segment needs to cover most of the true anomalous region, because otherwise adding an additional segment to τ would reduce the penalised cost.
Indeed,ŝ k ≤ s k + 10C The next step consists of showing that (ŝ k ,ê k ,Ĵ) does not extend too far beyond the kth anomalous region. Our approach consists of using Lemmata 12 and 13 to show that if this were to happen we could replace (ŝ k ,ê k ,Ĵ) by a different fitted segment in a way which reduces penalised cost. We will just show thatê k ≤ e k + 10C , as a observations both from the typical distribution and the kth anomalous window. It is possible to show that this partition can be replaced by splitting up [ŝ k + 1,ê k ] in such a way that the penalised cost is reduced.
• If J k is sparse, the casesĴ = 1, and |Ĵ| ≤ k * have to be considered separately. IfĴ = 1, with the inequality following from E 2 and the fact that splitting a segment does not increase the un-penalised cost. Lemma 12, then shows that the above quantity exceeds thus contradicting the optimality of τ . Similarly, if |Ĵ| ≤ k * , with the second inequality following from E 1 and the fact that splitting a segment does not increase the unpenalised cost. The third equality holds for large enough values of C. As before, Lemma 12 shows that the above quantity exceeds thus contradicting the optimality of τ .

Property II
We now prove that if all fitted segments of the optimal partition overlap with at most one true anomalous segment, then each true anomalous segment must overlap with exactly one fitted segment. To this end, we now define T 2 as the set of partitions in which each fitted anomalous segment overlaps with exactly one truly anomalous region. More formally we define and note that T 1 ⊂ T 2 . The following proposition holds: Proposition 12. Let the assumptions of Theorem 1 hold. Given E, if a partition τ ∈ T 2 is optimal it must also satisfy τ ∈ T 1 if C exceeds a global constant.
Proof of Proposition 12: The proof has two parts: 1. We need to show that the optimality of τ implies that each true anomalous segment overlaps with at least one fitted segment in τ .
2. We need to show that the optimality of τ implies that each true anomalous segment overlaps with at most one fitted segment in τ .
We prove both statements by contradiction: First assume that τ is optimal but that there exists a k such that [s k + 1, e k ] is not covered at all by any fitted segment in τ . Then by Lemma 11, the partition τ ∪ (s k , e k , J k ), if the kth change is sparse, or τ ∪ (s k , e k , 1), if the kth change is dense, has a lower penalised cost than τ , so contradicting its optimality. Now assume that there exists a k such that τ contains two or more fitted segments overlapping with [s k +1, e k ]. We will show that it is possible to merge any two fitted segments (called (a, b, J 1 ), (c, d, J 2 ), where c ≥ b without loss of generality) in a way which reduces the total penalised cost thereby contradicting the optimality of τ . In order to do so, we define a = max(s k , a) and d = min(e k , d). The following two cases have to be considered separately, but share in the following idea: Merging (a, b, J 1 ), (c, d, J 2 ), into a single fitted segment increases the un-penalised cost by at most O( √ C). At the same time merging reduces the penalty by O(C). Hence, if C is large enough, merging reduces the overall penalised cost.
1. J k is dense : We will show that replacing (a, b, J 1 ), (c, d, J 2 ) with (a , d , 1) reduces the penalised cost. Lemma 14, implies that it is sufficient to show that We limit ourselves to proving the first statement, as the second one can be proven via a symmetrical argument. If J 1 = 1, the statement follows directly from Lemma 15. If |J 1 | ≤ k * , we first note that Lemma 15 implies that By optimality of τ , C(x a:b , J 1 ) < 0, must hold. This implies that .
2. J k is sparse : We will show that replacing (a, b, J 1 ), (c, d, J 2 ) with (a , d , J k ) reduces the penalised cost. Lemma 14, implies that it is sufficient to show that These proofs for both statements are symmetrical. We therefore only prove the first one. As before we begin by considering the case J 1 = 1. We have where the first inequality follows from Lemmata 15 and 19, while the second inequality golds if C exceeds a fixed constant. Turning to the case in which |J 1 | ≤ k * , we note that the same strategy of proof used for the case in which J k is dense can be reapplied, the only difference being that Lemma 17 has to be used instead of Lemma 18.

Property I
We will now prove that an optimal partition can not contain a fitted segment overlapping with more than one true anomalous segment. We formalise this in the following Proposition: Proposition 13. Let the assumptions of Theorem 1 hold. Let τ be an optimal partition. Then, τ ∈ T 2 , given that the event E holds and that the constant C exceeds a global constant.
Note that this result trivially holds when K = 1. In order to prove this proposition, we will use a variation of Proposition 12. For this we introduce the set of fitted sparse segments, which either begin or end at the start of a true anomalous segment and only contain a small fraction of the true anomalous segment Note that Proposition 14 does not assume that τ is optimal. Using these two propositions it is easy to derive the following: Proof of Proposition 13: Assume that the optimal partition τ is such that τ / ∈ T 2 . Then, by Proposition 15 there exists a partition τ ∈ T 2 such that C (x 1:n , τ ) > C (x 1:n , τ ) − 11 20   (s,e,J)∈τ ∩A1 (Cψ + C|J| log(p)) + (s,e,1)∈τ ∩A2 Moreover, Proposition 14 implies that there exists another partition τ ∈ T 2 such that C (x 1:n , τ ) ≥ C (x 1:n , τ ) + 6 10   (s,e,J)∈τ ∩A1 (Cψ + C|J| log(p)) + (s,e,1)∈τ ∩A2 which contradicts the optimality of τ . Proof of Proposition 14: Proposition 12 shows that fitting an anomalous region with two segments, or with one very short segment leaving most of the anomalous region uncovered is sub-optimal. This proposition goes further by showing it is suboptimal by at least O( 6 10 C). Crucially, this is larger than O( 1 2 C) and will help us break up fitted segments spanning multiple anomalous regions. The proof of this Proposition is similar in flavour to the proof of the second part of Proposition 12. The main idea is that there are at most two fitted partitions ∈ τ ∩ (A 1 ∪ A 2 ) overlapping with the kth true anomalous region. These partitions therefore leave at least 20C 2 k of the kth anomalous region uncovered. Therefore, if no other segment in τ overlaps with the kth anomalous region, one can be added without increasing the penalised cost. It can then be merged with the fitted partitions in ∈ τ ∩ (A 1 ∪ A 2 ) and overlap with the kth true anomalous region. This yields a new partition still in T 2 with the claimed reduction in penalised cost.
Since τ ∈ T 2 , we can consider each of the K true anomalous regions separately. We define the set of fitted segments in τ which overlap with the kth anomalous region to be Proving the full result is therefore equivalent to proving the existence of a τ k which yields the required reduction in penalised cost. The following 3 cases are possible: 1. |τ k ∩(A 1 ∪ A 2 ) | = 0, which happens when τ does not contain a short fitted segment at either the beginning or the end of the kth anomalous region. No further transformation is required in this case, i.e. τ k = τ k We will only explicitly describe the transformation for the second case, as applying it twice yields a transformation for the third case. Without loss of generality we further assume that τ k ∩ (A 1 ∪ A 2 ) = (s, e k , J), i.e. that the short fitted segment lies at the end of the kth anomalous window. A first special case can be treated very quickly. If |J| ≤ k * and C x (s+1):e k , J ≥ 6 10 C(ψ + |J| log(p)), removing (s, e k , J) from τ k is sufficient. If |J| ≤ k * and C x (s+1):e k , J < 6 10 C(ψ + |J| log(p)), we nevertheless have if J k is sparse and if J k is sparse as a direct consequence of Lemma 19 and, trivially, Consequently, if the next fitted change in τ k to the left of (s, e, J) is of the form (s,ẽ, J k ), if J k is sparse or (s,ẽ, 1) if J k is dense, for somes ≥ s k , Lemma 14 shows that the required reduction in penalised cost can be obtained by merging these two fitted segments. If there is no other fitted change in τ k , or if the next fitted segment in τ k to the left of (s, e, J) is (s,ẽ, J), whereẽ satisfies s −ẽ ≥ 10C 2 k , Lemma 11 implies that adding , s, 1) if J k is dense, does not increase the penalised cost. Lemma 14 can then be applied as before to show that merging this new fitted segment with (s, e, J) yields a new partition exhibiting the required reduction in penalised cost.
Hence, in order to finish proving the result we only need to show that any (s,ẽ, J) ∈ τ k can either be removed without increasing the penalised cost or replaced by (max(s, s k ),ẽ, J k ) in a way which increases the penalised cost by at most 5 40 C(|J k | log(p) + ψ) if J k is sparse or (max(s, s k ),ẽ, 1) in a way which increases the penalised cost by at most 5 40 C √ pψ + ψ if J k is dense. This however, was already shown in the proof of Proposition 12. This finishes the proof.
Proof of Proposition 15: If τ ∈ T 2 , the result trivially holds. In order to prove the result when τ / ∈ T 2 , we consider all possible fitted segments (s, e, J) ∈ τ \ T 2 which overlap with at least two anomalous regions and show that 1. No such segment can overlap a true fitted dense change, the k th say, by more than 10C 2 k as this would contradict the optimality of τ .
2. All other fitted segments, overlapping with at least two anomalous regions, including, potentially, a certain number of sparse changes by more that 10C can be replaced by fitted segments each overlapping with exactly one true anomalous segment in a way which strictly bounds the increase in penalised cost as stipulated by the proposition.

1)
First of all we can show that the optimality of τ implies that no partition (s, e, J) ∈ τ \ T 2 can overlap a dense change (the k th change say) by more than 10C of observations belonging to the k th anomalous window. Lemma 13 shows that such a segment can be replaced in a way which reduces the penalised cost by at least 4Cψ + 4C √ pψ. Overall, we would thus obtain a new partition with a lower penalised cost than τ contradicting the optimality of τ .
2) Consider now, a segment (s, e, J) ∈ τ \ T 2 not overlapping with any dense changes by more than 10C . and note that |J k | is sparse if k ∈ D s,e for some (s, e, J) ∈ τ \ T 2 . We have to consider the following 4 scenarios 1. The beginning of the fitted segment (s, e, J) ∈ τ \ T 2 overlaps with a true anomalous region [s k + 1, e k ], but does so by less than 10C 2. The end of the fitted segment (s, e, J) ∈ τ \ T 2 overlaps with a true anomalous region [s k + 1, e k ], but does so by less than 10C 2 k . i.e. ∃k : 3. Both apply 4. None of 1 and 2 apply. Note that this allows for the beginning and or the end of (s, e, J) ∈ τ \ T 2 to lie in a truly anomalous region provided the overlap with that region exceeds the critical threshold of 10C 2 .
We then replace (s, e, J) in τ to obtain a new partitionτ . depending on the cases above we defineτ to be 1. Only the number of fitted segments belonging to A 1 and/or A 2 depends on the case. Since applying this transformation for all (s, e, J) ∈ τ \ T 2 leads to a new partition τ which is contained in T 2 , it is sufficient to prove that each transformation individually increases the penalised cost by strictly less than 1. 11 20 C (ψ + |J| log(p)) if J is sparse or 11 20 depending on the case in order to prove the proposition. The fourth case follows directly from the following Lemma: Lemma 22. Let the event E hold and C exceed some global constant. Let s and e be such the fourth scenario applies, i.e.
Then, the following holds true for all sparse J C (x s,e , J) ≥ 19 20 C (ψ + |J| log(p)) + k∈Ds,e C x (s k +1):e k , J k Moreover, the following statement is also true: This Lemma can also be used to bound the increase in penalised cost obtained for the other three cases. The only difference is that (s, e, J) is first split up to twice in order to remove the short overlap with the true anomalous region at the beginning and/or the end. For the sake of brevity, we limit ourselves to write out the proof for the third case, for which the result is tightest. If, J is sparse, we have that where the inequality follows from Lemma 22. Similarly, if, J = 1 is dense, we have that where the inequalities follow from Lemma 22, E 9 , and C exceeding a global constant. This finishes the proof.

Proof of Lemma 1
The MGF of Z = (x − a) + is given by Evaluating the above at λ = 0 shows that the mean of Z is indeed µ = 2af (a) + (1 − a)P χ 2 1 > a . We therefore have Next note that Using the substitution y = x − 2λa, shows that We can now use this result to further bound the MGF of the truncated χ 2 1 . We consider two cases separately: The lower bound in 17 shows that d dλ log E e λZ −µ is bounded by The upper bound in 17 shows that d dλ log E e λZ − µ is bounded by Using the fact that µ + (1 − a) − 2λaP χ 2 1 < a < 0 and that 1 − µ ≥ 0, we can bound this by Since λ < 1 2 and 1 − a − µ ≤ 0, we have that where the last inequality follows from the fact that E χ 2 1 |χ 2 1 < a ≤ a/2, which is due to the fact that the pdf of the χ 2 1 -distribution is decreasing. Consequently, This shows that which finishes the proof.

Proof of Lemma 2
It is sufficient to show that We have that The derivative of left hand side with respect to µ is This is greater than 0, since the hazard rate of the Gaussian is increasing. Hence,

Proof of Lemma 3
Let Z ∼ χ 2 1 and write µ = E (Z − a) + . The MGF G(λ) of the random variable Consequently, and therefore, we must also have This proves that W is sub-Gaussian. Standard tail bounds for sub-Gaussian random variables then imply that independent random variables W1, ..., W k obeying the same law as W satisfy for positive integers k and all t ∈ R. This finishes the proof.

Proof of Lemma 4
The equality follows from Lemma 1. To prove the inequality, write G(τ ) = τ + 2af (a), where 0 ≤ τ ≤ 1 and a is defined by the equation P χ 2 1 > a = τ . Note that G(0) = 0 and dG dτ Hence, m + 2paf (a) = pG( m p ) is increasing in m. Moreover the following bounds hold on a: Therefore, we have that Noting that m + 2paf (a) = pG( m p ), finishes the proof.

Proof of Lemma 5
We know from Lemma 1, that Next note that Hence, This finishes the proof. Write Ta = a t=1 ηt and note that e λTa is a super-martingale for all λ > 0. The following holds: for some global constant A. Here the fifth inequality follows from Doob's martingale inequality. Next note that

Proof of Lemma 7
In this section, we define the event that Ea holds for a given set tuple (i, j) to be E (i,j) a , for a = 1, ..., 11. We know from the proof of Propositions 1 and 2 that P E (i,j) 1 > 1 − A1e −ψ holds. A Bonferroni correction over all possible tuples (i, j) then gives P (E1) > 1 − A1n 2 e −ψ . Furthermore, we have that with the inequality following from the tail bounds proven in [17]. A Bonferroni correction then gives P (E2) > 1 − n 2 e −ψ . Next note that for any fixed fixed i, j, and c holds for all s ≥ 0. Therefore with the last inequality again flowing from [17]. A Bonferroni correction then gives P (E3) > 1 − n 2 e −ψ . Next note that We can then use the well known tail bounds on the Normal distribution to show that for a constant A4. A Bonferroni correction over all possible sets S then shows that A Bonferroni correction over the indices i and j then proves that P (E4) > 1 − (A4e) n 2 e −ψ . Next, for fixed i and j, A Bonferroni correction over all indices i and j then gives P (E5) > 1 − (1 + A1)n 2 e −ψ . Next we note that for some constant A 4 . A Bonferroni correction over the sets S then gives that Therefore, proving that constants A 7 , A 8 , and A 9 exist such that P E (i,j ,j) 7 and P E (i,j ,j) 9 > 1 − A 9 e −ψ hold for fixed i, j , and j is equivalent to proving the existence of constants A 1 , A 2 , and A 3 such that P E (i,j) 1 hold. This was already done earlier in the proof. A Bonferroni correction over all possible i, j , and j then yields The fact that shows that P E (i,j) 10 > 1 − A 10 e −ψ for some constant A 10 . The cardinality of the set of allowed tuples (i, j) is strictly less than n 2 . Consequently P (E 10 ) > 1 − A 10 n 2 e −ψ . The same argument can be used to show that P E (i,e k ,j) 11 > 1 − A 11 e −ψ . A Bonferroni correction over all triplets (i, e k , j) then proves that P (E 11 ) > 1 − A 11 n 3 e −ψ

Proof of Lemma 8
This Lemma can be proven using straightforward algebra.
Next we note that the following holds for all a, b, y, and z: Thus,

Proof of Lemma 11
This Lemma proves MVCAPA has power at detecting anomalous regions. We begin by considering the case in which J k is dense. We have: with the first inequality following form E 10 and E 3 , the second from the AM-GM inequality, the third from the condition on j − i + 1, and the last one holds if C exceeds a global constant. The proof for when J k is sparse is almost identical. We have that: where the first inequality follows from E 4 , the second from the AM-GM inequality, the third from the condition on j − i + 1, and the last one holds if C exceeds a global constant.

Proof of Lemma 12
This Lemma prevents fitted changes from containing too many observations belonging to the typical distribution. We limit ourselves to proving the result for the first case, since the proof of the second case is symmetrical. We begin by proving the result for the case in which |J k | ≤ k * . Writing, e = e k + 10 C where the first inequality follows from E 11 and the second inequality from the AM-GM inequality. Combining all the above, we obtain that C (xi:j, J k ) ≥ C (x i:s , J k ) + C x (s +1):e k , J k + C x (e k +1):e , J k + 14 ≥ C (xi:e k , J k ) + 19 20 C (ψ + |J k | log(p)) + (C − 2) (ψ + |J k | log(p)) + 14 where the second inequality follows from Lemma 10 and E 2 . The proof for the case in which |J k | > k * is very similar. We the have that C (x i:j , 1) ≥ C (x i:s , 1) + C x (s +1):e , 1 + C x (e +1):j , 1 − 2Cψ − 2(C + 2) pψ ≥ C (x i:s , 1) + C x (s +1):e , 1 − (C + 6) ψ + pψ , with the first inequality following from Lemma 8 and the event E 3 the second being due to E 2 . The remainder of the proof of the Lemma is very similar to the sparse case and has therefore been omitted.

Proof of Lemma 13
This Lemma shows that the optimal partition can not contain fitted segments containing more than 10 C 2 k observations from both the typical distribution and a dense anomalous region. If J = 1 the result follows a fortiori from Lemma 12. Assume now that |J| ≤ k * . As in the proof of Lemma 12, we limit ourselves to proving the first case, the proof of the other one being symmetrical. The following holds: C (x i:j , J) = C (x i:j , 1) − (p + Cψ + C pψ) + c / ∈J (j − i + 1) (x i:j ) 2 + Cψ + C|J| log(p) ≥ C (x i:j , 1) − C(ψ + pψ) − 2 pψ − 2ψ − 2|J| log(p) + Cψ + C|J| log(p) ≥ C (x i:e k , 1) + 4C(ψ + pψ), where the first inequality follows from the event E 5 and the second one from Lemma 12 and a choice of C exceeding some global constant.

Proof of Lemma 15
This Lemma shows that if a fitted segment contains observations belonging to the typical distribution it can be trimmed to containing only anomalous observations without increasing the penalised cost by more than O(1) . We begin by proving the sparse case where the first and third inequality follows from the fact that introducing free splits reduces un-penalised cost whilst the second and third inequality follows from E 1 . Note that if j = j and/or i = i the first and second and/or the third and forth step are not necessary. The result nevertheless holds. A similar proof can be derived for the dense case: with the first and third inequalities following form Lemma 9, and the second and fourth from E 2 .

Proof of Lemma 17
This Lemma shows that if removing a fitted sparse segment does not result in a reduction in penalised cost of O( 1 20 C), the increase in penalised cost incurred for replacing it with the sparse ground truth is O( 1 20 C). We will use a very similar strategy to the one we used to prove Lemma 18. We begin by noting that If |J| > 19 20 |J k |, E 1 bounds (18)  where the first inequality follows from the fact that |J k | < |J| + |A| and the second one holds if C exceeds a global constant. This finishes the proof.

Proof of Lemma 19
This Lemma bounds the increase in penalised cost incurred when transitioning from a fitted dense segment to the sparse ground truth. We have that C (x i:j , J k ) − C (x i:j , 1) = C|J k | log(p) − C pψ + p + c / ∈J k (j − i + 1) η c i:j 2 ≤ C|J k | log(p) − C pψ + p + p + 2ψ + 2 pψ = C|J k | log(p) − C pψ + 2 pψ + 2ψ for large enough C. Here the first inequality follows from E 2 and the second inequality holds because |J k | ≤ k * .
holds. In particular, we therefore have that where the first inequality follows from using a partition which cuts 10C