Changepoint Detection in the Presence of Outliers

Many traditional methods for identifying changepoints can struggle in the presence of outliers, or when the noise is heavy-tailed. Often they will infer additional changepoints in order to fit the outliers. To overcome this problem, data often needs to be pre-processed to remove outliers, though this is difficult for applications where the data needs to be analysed online. We present an approach to changepoint detection that is robust to the presence of outliers. The idea is to adapt existing penalised cost approaches for detecting changes so that they use loss functions that are less sensitive to outliers. We argue that loss functions that are bounded, such as the classical biweight loss, are particularly suitable -- as we show that only bounded loss functions are robust to arbitrarily extreme outliers. We present an efficient dynamic programming algorithm that can find the optimal segmentation under our penalised cost criteria. Importantly, this algorithm can be used in settings where the data needs to be analysed online. We show that we can consistently estimate the number of changepoints, and accurately estimate their locations, using the biweight loss function. We demonstrate the usefulness of our approach for applications such as analysing well-log data, detecting copy number variation, and detecting tampering of wireless devices.


Introduction
Changepoint detection has been identified as one of the major challenges for modern, big data applications (National Research Council, 2013). The problem arises when analysing data that can be ordered, for example time-series or genomics data where observations are ordered by time or position on a chromosome respectively. Changepoint detection refers to locating positions in time or position where some aspect of the data of interest, such as location, scale or distribution, changes. There has been a recent explosion in methods for detecting changes (e.g. Frick et al., 2014;Fryzlewicz, 2014;Dette and Wied, 2015;Haynes et al., 2016b, and references therein) in recent years, in part motivated by the range of applications for which changepoint detection is important.
What has received less attention is the problem of distinguishing between changepoints and outliers.
To give an example of the issue outliers can cause when attempting to detect changepoint, consider the motivating application of detecting changes in well-log data. An example of such data, taken originally fromÓ Ruanaidh and Fitzgerald (1996), is shown in Figure 1. This data was collected from a probe being lowered into a bore-hole. As it is lowered the probe takes measurements of the rock that it is passing through. As the probe moves from one type of rock strata to another, there is an abrupt change in the measurements. It is these changes in rock strata that we wish to detect. The real motivation for collecting this data was to detect these changes in real-time. This would enable changes in rock strata that are being drilled through to be quickly detected, so that appropriate changes to the settings of the drill can be made.
The data in the top-left plot in Figure 1 has been analysed by many different change detection methods (e.g.Ó Ruanaidh and Fitzgerald, 1996;Fearnhead, 2006;Adams and MacKay, 2007;Wyse et al., 2011;Ruggieri and Antonellis, 2016, amongst many others). However, this plot actually shows data that has been pre-processed to remove outliers. The real data that was collected by the probe is changes and outliers when analysing the real data.
This lack of robustness for detecting a change in mean in the presence of outliers for many changepoint methods stems from explicit or implicit modelling assumptions of Gaussian noise. For example, methods based on a likelihood-ratio test for detecting a change (Worsley, 1979), or that use a penalised likelihood approach to detect multiple changes (Killick et al., 2012), or that do a Bayesian analysis (Yao, 1984), may be based on a Gaussian likelihood and thus explicitly assume Gaussian noise. Alternative methods, such as using an L 2 (square error) loss or a cusum based approach (Page, 1954), may not make such an assumption explicitly. However the resulting method are closely related to those based on a Gaussian likelihood (see e.g. Hinkley, 1971), and thus are implicitly making similar assumptions. Whilst these methods show some robustness to heaviertailed noise (Lavielle and Moulines, 2000), in practice they can seriously over-estimate the number of changes in the presence of outliers.
Our approach is based on ideas from robust statistics, namely replacing an L 2 loss with an alternative loss function that is less sensitive to outliers. We then use such a loss function within a penalised cost approach to estimating multiple changepoints. The use of alternative loss functions as a way to make changepoint detection robust to ouliers has been considered before (e.g. Hušková and Sen, 1989;Hušková, 1991;Hušková and Picek, 2005;Hušková, 2013). That work derives cusum-like tests for a single changepoint. Such a test for a single changepoint can then be used with binary segmentation to find multiple changes. As we discuss more fully in Section 2.2, this approach suffers from two draw-backs compared to a penalised cost approach. The test statistic is based upon how well the data can be modelled as not having a change, and does not directly compare this with how well we can fit the data with one or more changepoints. Thus it could spuriously infer a change if we have a cluster of outliers at consecutive time-points, even if the value of those outliers are not consistent with them coming from the same distribution. Secondly, for detecting multiple changepoints, this method suffers from the difficulties that binary segmentation can have, for example a lack of power to detect a short segment if the distribution before and after that segment are similar.
One challenge with the penalised cost approach that we suggest is minimising this cost, which we need to do to infer the changepoints. We show how recent efficieny dynamic programming algorithms (Maidstone et al., 2016;Rigaill, 2015) can be adapted to solve this minimisation problem.
Our algorithm can use any cost function provided we are interested in the change of univariate parameter, such as the location parameter for univariate data, and the cost function is piecewise quadratic. Importantly these algorithms are sequentially in nature, and thus can be directly applied in situations which need an online analysis of the data. We derive bounds on the computational complexity of this algorithm, and present empirical results that suggest the expected computational cost of the algorithm is linear in the number of data points.
Whilst our approach can be used with a range of cost functions, we particularly recommend using a cost function that is bounded. We present a theoretical result that shows that we need a bounded cost function if we wish our method to be robust to any single outlier. The simplest such cost function is the bi-weight loss (Huber, 2011) which is the pointwise minimum of an L 2 loss and a constant. As an illustration of the robustness of this choice, we consider a signal of length 200 with a single change at t = 100. For the first segment we simulate data using a normal distribution with mean 0 and variance 1 and for the second segment a normal distribution with mean 1 and variance 1. We then consider two outlier data-points at position i = 50 and i = 150. We set the value of these two outliers at d and 1 + d respectively. In Figure 2 (Left) we provide an example of such a profile for d = 4. We consider various values for d from 0 to 10 and we assess over 10000 simulated profiles per d the probability of recovering more than 1 change for both the L 2 loss and the biweight loss that we propose in this paper. Importantly the L 2 is unbounded whereas the biweight loss is bounded. It can be seen in Figure 2 that for large value of d the L 2 loss recover more than 1 change whereas our proposed biweight loss does not.
To illustrate the usefulness of our approach, with the biweight loss, in practice, we present its use for three distinct applications. The first is for the online analysis of the well-log data of Figure 1.
Secondly we show that it out-performs existing methods for detecting copy number variation. This includes performing better than methods that pre-process the data in an attempt to remove outliers.
By comparison our approach is easier to implement as it does not require any pre-processing steps. more changes for the L 2 and proposed biweight loss. We used a penalty of 2 log(n) for both.
Finally we consider the problem of detecting tampering of wireless security devices. Results here show our method can reliable distinguish between actual tampering events and changes in the data caused by short-term environmental factors.
Proofs of results, are given in the Appendices in the supplementary material. Code implenting the new methods in this paper is available from https://github.com/guillemr/robust-fpop.

Model Definition
Assume we have data ordered by some covariate, such as time or position along a chromosome.
Denote the data by y = (y 1 , . . . , y n ). We will use that notation that, for s ≤ t, the set of observations from time s to time t is y s:t = (y s , ..., y t ). If we assume that there are k changepoints in the data, this will correspond to the data being split into k + 1 distinct segments. We let the location of the jth changepoint be τ j for j = 1, . . . , k, and set τ 0 = 0 and τ k+1 = n. The jth segment will consist of data points y τ j−1 +1 , . . . , y τ j . We let τ = (τ 0 , . . . , τ k+1 ) be the set of changepoints.
The statistical problem we are considering is how to infer both the number of changepoints and their locations. We assume the changepoints correspond to abrupt changes in the location, that is mean, median or other quantile, of the data. We will focus on a minimum penalised cost approach to the problem. This approach encompasses penalised likelihood approaches to changepoint detection amongst others.

5
To define our penalised cost, we first introduce a cost function for a single observation, y, and a segment-specific location parameter θ. We denote this as γ(y; θ). For a penalised likelihood approach this cost would be equal to minus the log-likelihood. The class of costs we will consider are discussed below.
We can now define the cost associated with a segment of data, y s:t . This is the minimum, over the segment-specific parameter θ, of the sum of the costs associated with each observation in the segment. Finally we then define a penalised cost for a segmentation as where β > 0 is a chosen constant that penalises the introduction of changepoints. We then estimate the number and position of the changepoints by the value of k and τ 1:k that minimise this penalised cost. The value of β has a substantial impact on the number of changepoints that are estimated (see Haynes et al., 2016a, for examples of this), with larger values of β leading to fewer estimated changepoints.
For inferring changes in the mean of the data, it is common to use the squared-error cost function (e.g. Yao and Au, 1989;Lavielle and Moulines, 2000) γ(y; θ) = (y − θ) 2 .
In this case, the penalised cost approach corresponds to a penalised likelihood approach where the data within a segment are IID Gaussian with common variance. Minimising a penalised cost of this form is closely related to binary segmentation procedures based on CUSUM statistics (e.g. Vostrikova, 1981;Bai, 1997;Fryzlewicz, 2014), as discussed in Killick et al. (2012). Use of the square-error cost function results in an approach that is very sensitive to outliers. For example, this cost function was the one used in the analysis of the well-log data in Figure 1, where we saw that it struggles to distinguish outliers from actual changes of interest.

Penalised Costs based on M-estimation
To develop a changepoint approach that can reliably detect changepoints in the presence of outliers we need a cost function that increases at a slower rate in |y − θ|. Standard examples (Huber, 2011) are absolute error, γ(y; θ) = |y − θ|, winsorised cost or if interest lies in changes in the uth quantile for 0 < u < 1, These are summarised in Figure 3.
We will develop an algorithm for finding the best segmentation under a penalised cost criteria that can deal with any of these choices for the cost. However, in practice we particularly advocate the use of the biweight loss. For a penalised cost approach to detecting changepoints to be robust to extreme outliers we will need the cost function to be bounded. For unbounded costs functions, such as the absolute error loss or winsorised loss, a penalised cost approach will place an outlier in a segment on its own if that outlier is sufficiently extreme. This is shown by the following result.
Theorem 2.1 Assume that the cost function satisfies γ(y; θ) = 0 for θ = y, for any θ it is monotone increasing with |y − θ| and that γ(y; θ) → ∞ as |y − θ| → ∞. Choose any t ∈ {1, . . . , n} and fix the set of other observations, y s for s = t. Then there exists values of y t such that the optimal segmenation will have changepoints at t − 1 and t.
If we choose a cost function, such as the biweight loss, that is bounded, then this will impose a minimum segment length on the segmentations that we infer using the penalised cost function.
Providing this minimum segment length is greater than 1, our inference procedure will be robust to 7 the presence of extreme outliers -unless these outliers cluster at similar values, and for a number of consecutive time-points greater than our minimum segment length.
Theorem 2.2 If the cost function satisfies 0 ≤ γ(y; θ) ≤ K, and we infer changepoints by minimising the penalised cost function (1) with penalty β for adding a changepoint, then all inferred segments will be of length greater than β/K.

Alternative Robust Changepoint Methods
There have been other proposed M -procedures for robust detection of changepoints (Hušková and Sen, 1989;Hušková, 1991;Hušková and Picek, 2005;Hušková, 2013). These differ from our approach in that they are based on sequentially applying tests for single changepoints. One approach is to use a Wald-type test. For a loss function γ(y; θ) which depends only on y − θ, define γ(y; θ) = ρ(y − θ) and define φ to be the first derivative of ρ. Then we can estimate a common θ for data y 1:n by with respect to θ. In many cases this is equivalent to solving Ifθ denotes the estimate we obtatin, then we can then define residuals as φ(y i −θ), and their partial sums, or cusums, by A Wald-type test is then based on a test-statistic of the form Large values of T n are taken as evidence for a change. The position of a changepoint is then inferred at the position k which maximises the right-hand side.
There are two main differences between this approach and ours in terms of detecting a single changepoint. The first is that the behaviour of the cost functions, and hence which are the most appropriate choices, are very different for the two approaches. The Wald-type test statistic is appropriate only for convex loss functions. So, for example, the bi-weight loss is not appropriate for use with this approach. To see this note that the derivative of the bi-weight loss satisfies φ(x) = 0 for |x| > K. Thus large abrupt changes in the data will lead to M -residuals which are 0, and hence provide no evidence for a change in the test statistic.
8 Secondly any loss function which increases linearly in |y − θ| for sufficiently large |y − θ| will result in φ(y i − θ) being constant for large |y i − θ|. Thus, large residuals will have a bounded contribution to the test statistic. The use of absolute error loss or winsorised loss within a Wald-type test will have a similar robustness to extreme outliers that bounded cost functions have for the penalised cost approach.
Thus for detecting a single change, a Wald-type test with winsorised loss and a penalised cost approach with bi-weight loss will behave similarly if they use the same, large, K. The main difference is that when detecting a change the Wald-type test statistic does not consider whether the data after a putative changepoint is consistent with data from a single segment. Thus a cluster of outliers of the same sign that occur concurrently but which are very different in value, such as we observe for the well-log data, will produce a similar value for the test-statistic as a set of concurrent observations that are very different to the other data points but are also very similar to one another.
By comparison, the cost based approach would, correctly, say the latter provided substantially more evidence for the presence of a change.
Note that the penalised cost approach generalises more naturally to detecting multiple changepoints.
The use of these Wald-type tests for detecting a single changepoint is currently used within a binary segmentation procedure in order to infer multiple changepoints. However binary segmentation can sometimes perform poorly as it tries to add changepoints one at a time, and cannot change earlier changepoints that have been detected. See Fryzlewicz (2014) for discussion of the problems with binary segmentation, together with a possible improvement.

Minimising the Penalised Cost
An issue with detecting changepoints using any of these cost functions, is how can we efficiently minimise the resulting penalised cost over all segmentations? We present an efficient dynamic programming algorithm for performing this minimisation exactly. This algorithm is an extension of the pruned DP algorithm Rigaill (2010) and the FPOP algorithm Maidstone et al. (2016) (see also Johnson, 2013) to the robust cost functions. We will call the resulting algorithm R-FPOP.

A Dynamic Programming Recursion
We develop a recursion for finding the minimum cost (1) of segmenting data y 1:t for t = 1, . . . , n.
In the following we let τ denote a vector of changepoints. Furthermore we let S t denote the set of 9 possible changepoints for the y 1:t , so S t = {τ = τ 1:k : 0 < τ 1 < · · · < τ k < t} .
Note that S t has 2 t−1 elements.

Denote this as
where here and later we use the convention that k is the number of changepoints in τ , and that τ 0 = 0 and τ k+1 = t. First we introduce the minimum penalised cost of segmenting y 1:t conditional on the most recent segment having parameter θ, where we take the first summation on the right-hand side to be 0 if k = 0. Trivially we have The idea is to recursively calculate Q t (θ) for increasing values of t. To do this, we note that each element in S t is either an element in S t−1 or an element in S t−1 with the addition of a changepoint at t − 1. So The first equality comes from splitting the minimisation into the minimisation over the changepoints for y 1:t−1 and then whether there is or is not a changepoint at t − 1. The second equality comes from interchanging the order of the minimisations, and taking out the common γ(y t ; θ) term. The final equality comes from the definitions of Q t−1 (θ) and Q t−1 . The right-hand side just depends on

Solving the Recursion
We now show how we can efficiently solve the dynamic programming recursion from the previous section for cost functions like those introduced in Section 2. We make the assumption that the cost for any observation, γ(y t ; θ), viewed as function of θ can be written as a piecewise quadratic in θ.
Note that by quadratic we include the special cases of linear or constant functions of θ, and this definition covers all the costs introduced in Section 2.
As the set of piecewise quadratics is closed under both addition and minimsation, it follows that C t (θ) can be written as a piecewise quadratic for all t. We summarise C t (θ) by N t intervals (a i ], and associated quadratics q (t) i (θ). We assume that the intervals are ordered, so a i−1 (θ) for i = 2, . . . , N t . If this were not the case we could merge the neighbouring intervals.
We can split (3) into two steps. The first is and the second is For the first step we first calculate Q t−1 by first minimising the N t−1 quadratics defining Q t−1 (θ) on their respective intervals, and then calculating the minimum of these minima. We then solve the minimisation problem (4) on each of the N t−1 intervals. For interval i, the solution will either be q (t) i (θ), Q t−1 + β or we will need to split the interval into two or three smaller intervals, on which the solution will change between q (t) i (θ) and Q t−1 + β. Thus we will end with a set of N t−1 , or more, ordered intervals and corresponding quadratics that define Q * t (θ). We then prune these intervals by checking whether any neighbouring intervals both take the value Q t−1 + β, and merging these if they do. This will lead to a new set of N * t , say, ordered intervals, and associated quadratics, q * t,i (θ) say.
For each of the N * t intervals from the output of the minimisation problem we then add γ(y t ; θ) to the corresponding q * t,i (θ). This may involve splitting the ith interval into two or more smaller intervals if one or more of the points of change of the function γ(y t ; θ) are contained in it. This will lead to the N t intervals and corresponding quadratics that define Q t (θ).
The above describes how we recursively calculate Q t (θ). In practice we also want to then extract the optimal segmentation under our criteria. This is straightforward to do. For each of the intervals corresponding to different pieces of Q t (θ) we can associate a value of the most recent changepoint prior to t. When we evaluate Q t , we need to find which interval contains this value, and then the optimal value for the most recent changepoint prior to t is the value associated with that interval.
We can store these optimal values for all t, and after processing all data we can recursively track back through these values to extract the optimal segmentation. So we would first find the value of the most recent changepoint prior to n, τ say, then find the value of the most recent changepoint prior to τ . We repeat this until the most recent changepoint is at 0, corresponding to no earlier changepoints.
Pseudo-code for R-FPOP is given in Appendix C. An example of the steps involved in one iteration is given in Figure 4.

Computational Cost of R-FPOP
We now present results which bound the computational cost and storage requirements of R-FPOP.
As above we will assume that γ(y; θ) can be written as a piecewise quadratic with L pieces. The bounds that we get differ depending on whether, for a given y, γ(y; θ) is convex in θ. We first consider the convex case, which includes all the examples in Section 2 except the biweight loss.
Theorem 4.1 If γ is convex in θ and defined in L pieces R-FPOP stores at most 2t − 1 + t(L − 1) quadratics and intervals at step t.
Corollary 4.2 If γ is convex in θ and defined in L pieces, the space complexity of R-FPOP is O(n), and the time complexity of R-FPOP is O(n 2 ).
For the biweight loss, which is not convex, we get worse bounds on the complexity of R-FPOP.  These results give worst-case bounds on the time and storage complexity of R-FPOP. Below we investigate empirically the time and storage cost of R-FPOP in Section 5, where we observe R-FPOP having average computational cost that is linear in n when the number of changepoints is large and less than quadratic when there is no changepoint.

Simulation Study: Computational Cost
This paper is mostly concerned with the statistical performances of our robust estimators. Thus an in-dept analysis of the runtime of our approach is outside the scope of this paper. In this section we just aim at showing that our approach is easily applicable to large profiles (n = 10 3 to n = 10 6 ) in the sense that its runtime is comparable to other commonly used approach like FPOP (Maidstone et al., 2016), PELT (Killick et al., 2012), WBS (Fryzlewicz, 2014) or smuceR (Frick et al., 2014).
We used a standard laptop with an Intel Core i7-3687U CPU with 2.10GHz 4 Core and 7.7 Go of Ram. For the biweight loss, for a profile of length n = 10 6 and in the absence of any true change the runtime is around 4 seconds (slightly larger than FPOP see Figure

Simulation Study: Accuracy
We assessed the performance of our robust estimators using the simulation benchmark proposed in the WBS paper (Fryzlewicz, 2014). In that paper 5 scenarios are considered. These vary in length from n = 150 to n = 2048 and contain a variety of short and long segments, and a variety of sizes of the change in location from one segment to the next. We considered an additional scenario from Frick et al. (2014) corresponding to scenario 2 of WBS with a standard deviation of 0.2 rather than 0.3. In our simulation study we are interested to see how the presence of outliers or heavy tailed noise affect different changepoint methods, and so we will test each method assuming either t-distributed noise, or noise that is from a mixture of Gaussians. The underlying signals and example data for the three scenarios are shown in Figure 6. For all approaches we need to choose the value K in the cost function and the penalty/threshold for adding a changepoint. These will depend on the standard deviation of the noise. Our approach is to estimate this standard deviation using the median absolute deviation of the differenced time-series, as in Fryzlewicz (2014), which we denote asσ. We compared our various robust estimators (Huber and biweight loss) to binary segmentation using the robust cusum test (Hušková and Sen, 1989), described in Section 2.2 (Cusum). For the biweight loss we chose K = 3σ, so that extreme residuals according to a Gaussian model are treated as outliers. For the Huber loss we chose K = 1.345σ, a standard choice for trading statistical efficiency of estimation with robustness. We further set the penalty/threshold to be λ = 2σ 2 log(n)E(φ(Z) 2 ), where φ is the gradient of the loss function and Z is a standard Gaussian random variable. This is based on the Schwarz information criteria, adapted to account for the variability of loss function that is used (see, e.g., theoretical results in Hušková and Marušiaková, 2012, for further justification of this). We also compared to just using the standard square-error loss: implemented using FPOP (Maidstone et al., 2016); and to the WBS (Fryzlewicz, 2014) approach that uses a standard cusum test statistic for detecting changepoints.
We used λ = 2 log(n), which is consistent with our choice for R-FPOP, and is the value that gave the best results for these methods across the 6 scenarios when there is normal noise (see Maidstone et al., 2016).
We consider analysing data where the noise was from a t-distribution. We vary the degree of freedom from 3 to 100 to see of how varying how heavy-tailed the noise is affects the performance of different methods.
In Figures 7 and 8 we show the results of all approaches as a function of the degree of freedom. We compare methods based on how they estimate the underlying piecewise constant mean function, measured in terms of mean square error; and how well they estimate the segmentation, measured using the normalized rand-index. Our estimate of the mean function is based on estimating the mean within each inferred segment as the sample mean of data in that segment. The normalized rand-index measures the overlap between the true segmentation and the inferred segmentation, with larger values indicating a better estimation of the segmentation.
In terms of mean square error, for almost all scenarios we consider the biweight loss performs best when the degrees of freedom is small. It also appears to lose little in terms of accuracy when the degrees of freedom is large, and hence the noise is close to Gaussian. The robust cusum approach also performs well when the degrees of freedom are small, but in most cases it shows a marked drop in accuracy relative to the alternative methods when the noise is close to Gaussian. The results in terms of the quality of the segmentation, as measured using the rand-index, are more mixed. The biweight loss is clearly best in scenarios 1 and 2, but performs poorly for scenario 3. Here the use of the Huber loss appears to give best results across the different scenarios. Again we see that the use of the L2 loss, using either fpop or WBS, performs poorly when the degrees of freedom are small.

Online analysis of well-log data
We return to the motivating application of analysing the well-log data of Figure 1. For this data, due to the presence of substantial outliers we choose to use the biweight loss function, as to be robust to such outliers we need a bounded loss function (see Theorem 2.1). We set the threshold, K in (2), to be twice an estimate of the standard deviation of the observation noise. We set β to be 70 times the estimated variance of the noise. This is larger than that of the BIC penalty, but this is needed due to the presence of auto-correlation in the observation noise (Lavielle and Moulines, 2000), and is the same penalty used for the analysis presented in Figure 1. Choice of this penalty is important in practice, and for this type of application we would recommend testing a range of penalties (using the CROPS algorithm of Haynes et al., 2016a) on some training data to learn an appropriate penalty . Figure 9 shows the estimated changepoints we obtain from a batch analysis of the data. As we can see, using the biweight penalty makes the changepoint detection robust to the presence of the outliers. All obvious changes are detected, and we do not detect a change at any point where the outliers cluster.
As mentioned in the introduction, the motivation for analysing this data requires an online analysis.
We present output from such an online analysis in the right-hand plot of Figure 9. Here we plot    the esimate of the most recent changepoint prior to t, given data y 1:t , as a function of t. To help interpret the result we also show the locations of the changepoints inferred from the batch analysis. We see that we are able to quickly detect changes when they happen, and we have only one region where there is some fluctuation in where we estimate the most recent changepoint. Whilst by eye the plot may suggest we immediately detect the changes, there is actually some lag. This is inevitable when using the biweight loss, due to the presence of a minimum segment length that can be inferred (see Theorem 2.2). The lag in detecting the changepoint is between 21 and 27 observations for all except the final changepoint. The final inferred changepoint is less pronounced, and is not detected until after a lag of 40 observations. This lag can be reduced by increasing K, but at the expense of less robustness to outliers. Note the region of fluctuation over the estimate of the most recent changepoint corresponds to uncertainty about whether there are changepoints in the last inferred segment (corresponding to the final two changepoints inferred in the bottom-left plot of Figure 1). One disadvantage of detection methods that involve minimising a penalised cost, and of other methods that produce a single estimate of the changepoint locations, is that they do not quantify the uncertainty in the estimate. A standard way to deal with those is to use the smooth.CNA function of the well known DNAcopy package (Bengtsson et al., 2016). This function shrinks outliers towards the value of its neighbors.

Estimating Copy Number Variation
Once this is done one can run a preferred segmentation approach. As we will see below, this heuristic preprocessing procedure greatly improves changepoint detection. We want to compare such a twostage approach to a simpler analysis where we analyse data using our penalised cost approach with the biweight cost.
To assess the performance of our approach on DNA copy number data we used the jointseg package Overall our robust biweight loss outperforms the L2 loss following outlier removal and the Cusum approach. For low tumor fractions (0.3 and 0.5 GSE29172 and 0.34 GSE11976) the biweight loss is possibly slightly better than the Cusum approach. For a tumor fraction of 1 the biweight loss is slightly better than the L2 following outlier removal. In other cases it is clearly better. Results are shown for the two datasets and a tumor fraction of 0.7 and 0.79 in Figure 11. Results for other tumor fractions are provided in figures in Appendix D.

Wireless Tampering
We now consider an application which looks at security of the Internet of Things (IoT). Many IoT devices use WiFi to communicate. Often, for example with surveillance systems, these need a high level of security. Thus it is important to be able to detect if a device has been tampered with.
WiFi signals include a "preamble" which is used by the receiver to determine channel state. One approach that can be used to detect tampering is to monitor channel state variation (Bagci, 2016).
Abrupt changes in it could indicate some tampering event. However changes can also be cause by less sinister events, such as movement of people within the communication environment. Thus the challenge is to detect a change caused by tampering as opposed to any "outliers" caused by such temporary environmental factors. Figure 12 shows some time-series of channel state information (CSI) that has been extracted from the preamable from a signal sent by a single IoT device. This data is taken from Bagci et al. (2015), where a controlled experiment was performed, with an actual tampering event occurring after 22 minutes. Before this tampering event, there was movement of people around the device, which has a short-term effect on the time-series data.
In practice the channel state information from an IoT device is multi-dimensional, and we show time-series for 6 out of 90 dimensions. Whilst ideally we would jointly analyse the data from all 90 time-series that we get from the device, we will just consider analysing each time-series individually. Our interest is to see how viable it is to use our approach, with the biweight cost, to accurately distinguish between tampering event and any effects due to temporary environmental factors. The six time-series we show each show different patterns, both in terms of the change caused by tampering, and the effect of people walking near the device. As such they give a thorough testing of any approach. We implemented the biweight loss with the SIC penalty for a change, and with K chosen so that the minimum segment length (see Theorem 2.2) corresponds to a period of 20 seconds. Results are shown in Figure 12, where we see that we accurately only detect the change that corresponds to the tampering event in all cases. Proof of Theorem 2.1. Consider any segmentation of the data that does not include changepoints at both t − 1 and t. We will show that for sufficiently large y t , that adding changepoints at both t − 1 and t (or at just one of these if the segmentation has a change at the other time-point) will reduce the penalised cost. Thus the optimal segmentation must have changes at both t − 1 and t.
Let the segment in the original segmentation that contains y t contain y s:u for s < t and u > t. The change in cost between this segmentation and one with changepoints added at t − 1 and t will be Here we have used the fact that the only change in cost will be for fitting y u:t as the other segmentations are unchanged. The new segmentation has segments which include y s:t−1 , y t and y t+1:u and introduces two extra changepoints. The cost of the segmentation which just includes y t is 0. We need to show that for large enough y t this will always be negative. To see this we use the fact that min θ u i=s γ(y i ; θ) ≥ min{γ(y t ; (y t + y t+1 )/2), γ(y t+1 ; (y t + y t+1 )/2), and this tends to infinity as y t → ∞. The other terms in (5) do not depend on y t , and hence (5) will be negative for sufficiently large y t .
The proof for s ≤ t or u ≥ t follows similarly.
Proof of Theorem 2.2. We will prove this by showing that for any segmentation with a segment which is shorter than β/K, we can reduced the penalised cost by removing the changepoint at either the start or end of the segment.
Consider a segmentation of the data with neighbouring segments y s:t and y t+1:u . If either t−s < β/K or u − t − 1 < β/K then removing the changepoint at t will reduce the penalised cost. Without loss of generality assume t − s < β/K. The change in cost of removing the changepoint at t is The first inequality uses the fact that the cost for segmenting y s:u is less than the cost if we fix θ to the optimal value for segmenting y t+1:u . We then bound the contribution of the cost for each of y s:t by K .The second inequality the fact that the costs are positive. We have that this change in cost is negative, as required, because (t − s) < β/K.

Appendix B Proofs from Section 4
Proof of Theorem 4.1. Given that γ is convex then, following the proof of the worst case complexity of the pDPA (Rigaill (2015) appendix A), we get that the function Q t (θ) can be described in at most 2t − 1 intervals such that for each interval there is a single value for the best time of the most recent changepoint. That is we can define intervals I k for k = 1, . . . , K, with K < 2t, such that for a given k there exists s such that ∀θ ∈ I k = [s k , e k ] Q s + n i=s+1 γ(y i ; θ) + β = Q t (θ).
Allc t+1:n (θ) are themselves defined in pieces, as sums of γ(y i , θ). For each γ(y i , θ) we need to consider L intervals and thus L − 1 points between intervals. We call these points T i,j for j in 1 to L − 1. For a given I k = [s k , e k ] define N k to be the number of the T i,j points that are in the open interval (s k , e k ). As all I k are disjoint then each T i,j can only appear in a single interval. So On I k R-FPOP will thus define Q t (θ) using N k + 1 intervals. Thus R-FPOP uses K k=1 (N k + 1) ≤ t(L − 1) + 2t − 1 intervals to define Q t (θ) .
Proof of Corollary 4.2. The space complexity is obtained using the fact that the number of intervals is bounded by 2n − 1 + n(L − 1) and the fact that on each interval we need to store a quadratic.
As for the time complexity, at step t R-FPOP needs to consider at most 2t − 1 + t(L − 1) ordered intervals. The key to bounding the time-complexity is that we can split up the operations at step t of R-FPOP into a series of operations on each interval. The cost for each operations on an interval is O(1), and as there are O(t) intervals the overall cost of the t iteration is O(t). The details are as follows.
On each of these intervals we will compute the roots of Q t (θ) minus a constant function. According to the number of roots the interval will be split in at most three (because on each interval Q t (θ) is Algorithm 3: Sub-routine to recover the minimum Q t and best change τ t of Q t (θ) Input : Q t (θ) a function defined on N t intervals: (A for j = 1 to j = n t + 1 do For θ in (R j , R j+1 ] set Q * t (θ) = Q t (θ) ; Set τ * (N * t+1 ) t+1