A Note on Online Change Point Detection

Online change point detection is originated in sequential analysis, which has been thoroughly studied for more than half century. A variety of methods and optimality results have been established over the years. In this paper, we are concerned with the univariate online change point detection problem allowing for all model parameters to change. We establish a theoretical framework to allow for more refined minimax results. This includes a phase transition phenomenon and a study of the detection delay under three forms of Type-I error controls. We also examine a common belief on the conservativeness of different Type-I error control strategies and provide results for potentially multiple change points scenarios.


Introduction
The WWII is the birthplace of many statistical methods and theory, among which the sequential analysis was pioneered by Abraham Wald in response to a request from the Navy (Section 6 in Wallis, 1980).Since then sequential analysis has been blooming due to the demands from quality control in manufactory, and recently from a much broader range of application areas, including climatology, neuroscience, cyber security, finance, to name but a few.The original problem concerned in Wald's seminal work Wald (1945) is that, given a sequence of independent and identically distributed observations {X i } i=1,2,... , one wishes to test the null and alternative hypotheses, namely X i ∼ f 0 and X i ∼ f 1 , respectively.The sequential probability ratio testing procedure proposed in Wald (1945) pre-specifies upper and lower thresholds and one rejects the null hypothesis when the sequential probability ratio t i=1 {f 1 (X i )/f 0 (X i )} crosses these thresholds.A closely related problem is sequential/online change point detection problem, a generic form of which is stated as follows: where F 0 = F 1 are two distribution functions and ∆ ≥ 1 is an unknown positive integer called the change point.The tasks associated with this problem are the following.
1.If ∆ < ∞, then one would like to provide t, an estimator of ∆, such that t > ∆ and the quantity t − ∆ is small.This is studied in Section 2.2.
2. If ∆ = ∞, i.e. there exists no change point, then one would like to either control the Type-I error pr{∀t < ∞, t = t} or control the lower bound of E ∞ ( t), namely the average run length, where the subscript ∞ indicates that ∆ = ∞.This is studied in Section 2.3.
3. If there are potentially multiple change points, i.e.
then one would like to provide a sequence of change points estimators t k , k = 1, . . ., K, such that ∆ k < t k < ∆ k+1 , where ∆ K+1 = ∞.In addition, one would like to control the upper bound of max k=1,...,K ( t k − ∆ k ).This is studied in Section 3.
In addition to the aforementioned three tasks, it is also important to study the hardness of the problem.This can be characterized by information-theoretic lower bounds on a certain form of signal-to-noise ratio, including the following.
4.1 A lower bound on the signal-to-noise ratio indicates a phase transition in the parameter space.
In the low-signal-to-noise ratio regime, no algorithm is guaranteed to produce consistent estimators.We will provide definitions of consistency in the sequel.
4.2 A lower bound on the estimation error ( t − ∆) + which holds for any distribution in the high-signal-to-noise ratio regime.
These two tasks will be studied in Section 4.

Relevant literature
As we have mentioned, Wald (1945), as a prelude of the sequential analysis, kicked off the statistical research on online change point detection problems.A famous extension of Wald (1945) is the cumulative sum statistic proposed in Page (1954).The optimality of Wald (1945) and Page (1954) was, to the best of our knowledge, studied first in Lorden (1971), which showed that among all the estimators which have average run lengths lower bounded by γ, the optimal detection delay rate is of order log(γ)/KL(F 0 , F 1 ), as γ → ∞, where KL(•, •) is the Kullback-Leiber divergence.Moustakides (1986) and Ritov (1990) reiterated this minimax result and showed that in the optimality framework studied in Lorden (1971), the cumulative sum statistics is optimal.Similar results have also been derived in the same framework using a change-of-measure argument under more general assumptions in Lai (1981), Lai (1998) and Lai (2001), among others.In almost all of the second half of the 20th century, the research on online change point detection focused on optimizing the expected delay time.The motivations back then were mainly from the manufactory sector, with quality control as the centre applications.The data type studied were mainly univariate sequences, and the results were almost all asymptotic.We refer readers to Lai (1995Lai ( , 2001) ) for comprehensive reviews.
Before proceeding, we would like to emphasize that there is a fundamental difference between the optimality results derived in the aforementioned work and the ones developed by us in Section 4.
In short, when considering the minimax lower bounds, the previous work only allows the change point location to vary and the optimality is derived in an asymptotic sense, by letting the lower bound of the average run length diverge.In this paper, we let all model parameters vary with the location of the change point when deriving the minimax lower bounds, and allow for fixed sample arguments.
The second act was kicked off by Chu et al. (1996), who formally stated the existence of 'noncontamination' data that one has a training data set of size m, i.e.X i ∼ F 0 , i = 1, . . ., m.The theoretical results built upon the above assumption are asymptotic in the sense that let m grow unbounded.One can control the Type-I error with this noncontamination condition.Since Chu et al. (1996), a large number of papers have been produced in this line of work, including univariate mean change (e.g.Aue and Horváth, 2004;Kirch, 2008), linear regression coefficients change (e.g.Aue et al., 2009;Hušková and Kirch, 2012), multivariate mean and/or variance change (e.g.Mei, 2010), univariate nonparametric change (e.g.Hušková et al., 2010;Hlávka et al., 2016;Desobry et al., 2005), Bayesian online change point detection (e.g.Fearnhead and Liu, 2007), to name but a few.More recent work includes He et al. (2018), which studied the sequential change point detection in a sequence of random graphs.Kirch and Weber (2018) used estimating equations as a unified framework to include the location shift, linear regression and autoregressive online change point detection.Kurt et al. (2018) converted different high-dimensional and/or nonparametric data to a univariate statistic and used geometric entropy minimisation methods to define an acceptance region.Chen (2019) constructed similarity measures via K-nearest neighbour estimators and then proposed a counting-based statistic to conduct sequential change point detection.Dette and Gösmann (2019) proposed a general framework for sequential change point detection case and obtained a limiting distribution of the proposed statistics.The framework can be used to handle high-dimensional and nonparametric cases.Gösmann et al. (2019) exploited a likelihood ratio based method and shared similar core techniques with Dette and Gösmann (2019).Keshavarz et al. (2018) considered online change point detection in a sequence of Gaussian graph models and obtained asymptotic Type-I and -II error controls.Chen et al. (2020) considered online change point detection in a sequence of Gaussian random vectors where the mean changes over time.Comprehensive monographs and survey papers include Siegmund (2013), Tartakovsky et al. (2014) and Namoano et al. (2019).
Lastly, we would like to mention Maillard (2019).This paper is the most relevant paper to ours, to the best of our knowledge, and has heavily inspired our paper.Maillard (2019) studied a univariate mean online change point detection problem and deployed the Laplace transform to control the probabilities of the events, on which the fluctuations are contained within desirable ranges, although the arguments thereof remain doubtful.Given the large probability events established based on the Laplace transforms, Type-I error controls and large probability detection delays were studied.A claim on the phase transition and robust analysis under multiple change points scenario were also available in Maillard (2019).There are a number of differences between this paper and Maillard ( 2019): (i) Instead of using the Laplace transform to establish large probability events, we summon the concentration inequalities for sub-Gaussian random variables, the union bound results and the peeling arguments.It was pointed out in the Discussion in Maillard (2019), other more advanced tools including the peeling arguments may improve the results by changing logarithmic terms to iterative logarithmic terms.We are, however, skeptical about the feasibility of such claim.(ii) In addition to the Type-I errors, which are available in Maillard (2019), we also provide average run length results and a parallel set of results by setting a lower bound for the average run lengths.This is a common practice in applications and is widely used in the existing literature (e.g.Lai, 1981).(iii) Despite that Maillard (2019) has much more modern arguments than papers in the 20th century, the results are presented in a more restrictive way.For instance, the 'phase transition' and 'detectability' are presented as a property of the location of the change point only.In this paper, we will exploit a signal-to-noise ratio, which is a function of the jump size, the variance and the change point location jointly.This setup enables further studies of high-dimensional data problems.
2 Detection delay and Type-I error controls

General setup
In this paper, we study the simplest online change point detection problem, where a sequence of independent univariate sub-Gaussian random variables with common fluctuation upper bound are collected, and the mean may change at one or multiple time points.In Assumption 1, we state the assumption which will be used in every result of this paper.
Assumption 1. Assume that {X 1 , X 2 , . ..} is a sequence of independent sub-Gaussian random variables satisfying E(X i ) = f i ∈ R, i = 1, 2, . . .and sup i=1,2,... X i ψ 2 ≤ σ, where • ψ 2 is the Orlicz-ψ 2 -norm of random variables, i.e. for any random variable X, In the rest of this section, we will first focus on one version of the detection procedure, providing its detection delay in Section 2.2 and its Type-I error control in Section 2.3.We will then discuss two other common alternative procedures, together with their performances and connections among all three procedures in Section 2.4.To conclude this section, we will discuss some practical issues in Section 2.5.

When ∆ < ∞
The first result is related with Task 1 we listed in Section 1.This states that when there exists a change point, we would like to spot it as soon as we can after it appears.We formally state this scenario in Assumption 2.
Assumption 2. Assume that there exists a positive integer ∆ ≥ 1 such that Assumptions 1 and 2 completely characterise the problem with three parameters, the upper bound of the fluctuations σ, the change point location ∆ and the jump size κ.We define a signalto-noise ratio defined as which is in parallel to the signal-to-noise ratio in the offline change point detection (e.g.Wang et al., 2018).
For readers who are familiar with the offline change point detection problems, the signal-tonoise ratio definition in (1) and the results to be shown in the sequel are déjà vu.This is indeed the case and we will provide more discussions on this in Section 4.3.Assumption 3. Assume that for any α ∈ (0, 1), there exists a sufficiently large absolute constant In Assumption 3, we provide a condition on the lower bound on the signal-to-noise ratio.This condition involves a quantity α, which can be interpreted as the upper bound of the Type-I error when there exists no change point -this will be discussed formally in Section 2.3.We remark that κ, σ and α can all be functions of ∆, and in this paper, the asymptotic regime is to let ∆ diverge.The constant C SNR > 0 is required to be sufficiently large.For instance, in the proof of Theorem 1, it is required to be C SNR > 8C 2 1 , where C 1 is solely determined by the sub-Gaussian tail bound and is detailed in (14).We do not claim the optimality of the constants.
We now describe the procedure in Algorithm 1, with the cumulative sum statistic stated in Definition 1.
Algorithm 1 Online change point detection.
The function 1{•} in Algorithm 1 is the indicator function only taking values zero or one.This notation is used throughout the paper.
Algorithm 1 scans through the data sequence using the cumulative sum statistic and a sequence of pre-specified threshold values.For any time point t ≥ 2, as long as there exists an integer s ∈ [1, t), such that the corresponding cumulative sum statistic D s,t exceeds the pre-specified threshold b t , we declare the existence of a change point prior to the current time point t.Algorithm 1 is written in the way that it will never terminate if there is no change point declared.In practice, Algorithm 1 can be terminated by the user or when there is no new data point.
Theorem 1.For any α ∈ (0, 1), let t be the output of Algorithm 1 with inputs satisfying the following.
• The cumulative sum statistic D s,t is defined in Definition 1.
Let d = t − ∆ be the delay time.It holds that where The proof of Theorem 1 is in Appendix B and calls an auxiliary result in Lemma 8.The detailed requirement on the constant C 1 can be found in ( 14).The choice of C d is nonempty provided that 1 .We show that using the cumulative sum statistic and under the signal-to-noise ratio condition in Assumption 3, with the thresholds of order σ log 1/2 (t/α), we are able to detect the change point with delay of order up to σ 2 log(∆/α)/κ 2 , with probability at least 1 − α.This rate is nearly minimax optimal, off by log(∆), as we show in Section 4.
It is worth pointing out Theorem 1 shows that it is guaranteed with probability at least 1 − α, we have d > 0. This explains the role of α.It is the de facto Type-I error control.Combining with Assumption 3, we can see that the smaller the tolerance on Type-I error is, the larger the required signal-to-noise ratio and the detection delay are.
In fact, this is the minimal condition required for Algorithm 1. Proposition 2 below shows that if then with probability at least 1 − α, Algorithm 1 cannot detect change points.
Proposition 2. For any α ∈ (0, 1), let t be the output of Algorithm 1 with inputs satisfying the following.
• The cumulative sum statistic D s,t is defined in Definition 1.

When ∆ = ∞
When there is no change point, it follows automatically from Theorem 1 that we can control the overall probability of having a false positive.To be specific, in the framework of Theorem 1 and let ∆ = ∞, it holds that pr{∀t < ∞, t > t} > 1 − α.
To be more specific and informative, we have the following result.
Theorem 3.For any α ∈ (0, 1), let t be the output of Algorithm 1 with inputs satisfying the following.
We have that pr{∀t < ∞, t > t} > 1 − α. (5) For any pre-specified positive integer T ≥ 2, define We have that The proof of Theorem 3 can be found in Appendix B. The quantity α appears in the algorithm through the threshold values defined in (2).
In practice, the data collection process cannot go on forever.The results in ( 6) describe what happens if we terminate the data collection process after collecting T time points even if no change point is declared.The second half of (6) suggests that E ∞ ( t) α log(T ).

Two variants
Recall that our ultimate goal is to detect the change point as soon as possible, if it exists, with the following guarantee.If there is no change point, then with probability at least 1 − α, we are not declaring any false alarm.In the existing literature, there are different strategies when controlling the false alarms and there exists a common belief that controlling the overall probability of not declaring a false alarm is too conservative.In this subsection, we study two common alternatives and provide in depth comparisons among these strategies.
The first variant considers a lower bound on E ∞ ( t), where the expectation is taken with respect to the distribution that there is no change point, i.e. ∆ = ∞.This is different from Theorem 1, which requires upper bounding the overall Type-I error.
Proposition 4. For γ ≥ 2, let t be the output of Algorithm 1 with inputs satisfying the following.
• The cumulative sum statistic D s,t is defined in Definition 1.
We have the following. where where d = t − ∆ is the delay time and C d is an absolute constant satisfying The proof of Proposition 4 is in Appendix B. The constant C 2 satisfies (19).In order to compare the strategies for lower bounding the average run length and upper bounding the overall Type-I error, we first compare Theorem 1 and Proposition 4 (ii), where it is assumed a change point exists.In terms of the upper bounds on the detection delay, the difference is between log(γ) and log(∆/α).Since it is assumed γ ≥ ∆, if one further assumes γ ≍ ∆, then these two detection delay upper bounds differ by a factor log(1/α), which is of order O(1) if one assumes α = O(1).In terms of the probability upper bound, the difference is between γ −1 and α.This further suggests, as long as γ ≍ α −1 , γ ≥ ∆ and γ ≍ ∆, these two strategies are equivalent in terms of controlling the detection delay.This connection has been studied before in a slightly different form, see e.g.Lai (1998).
The second variant is, instead of upper bounding the overall Type-I error like what we do in Theorem 1, upper bounding the Type-I error over any interval with a given length, which is also a common alternative strategy in practice.This is perhaps due to the concern on computational feasibility, as the computational cost can be directly controlled, and training can easily be done with historical data.In the result below, we provide a parallel result of Theorem 1 with a new definition of the cumulative sum statistic in Definition 2 and a slight twist of Algorithm 1 in Algorithm 2.
Algorithm 2 Online change point detection 2.
end while OUTPUT: t.
Proposition 5.For any α ∈ (0, 1) and γ ≥ 1, let t be defined as the output of Algorithm 2 with inputs satisfying the following.
• The cumulative sum statistic D e,s,t is defined in Definition 2.
We have the following.
(ii) If {X t } t=1,2,... in addition satisfy Assmuptions 2 and 3, and assume that where C 3 is an absolute constant, then it holds that where d = ( t − ∆) + is the delay time.
The proof of Proposition 5 can be found in Appendix B and the condition on C 3 is in (21).The strategy used in Proposition 5 can be seen as somewhere in between the strategies used in Theorem 1 and Proposition 4. The strategy here is instead of controlling the overall Type-I error, only controlling the Type-I error over any interval of length γ.The advantage here is, if γ < ∆, then it provides a smaller detection delay, i.e. improving from O{σ 2 κ −2 log(∆/α)} to O{σ 2 κ −2 log(γ/α 1/2 )}.However, the price it pays here is that if γ < ∆, then there is no guarantee on t ≥ ∆, i.e. preventing false alarms.Since the gain here is at most a logarithmic term in the detection delay and the loss here would be a lack of control in the false alarms, we would suggest upper bounding the overall Type-I error might be better than the strategy here.

Practical issues
We are to discuss two practical issues in this subsection.
The first issue is on tuning parameters.In Algorithm 1, we need a sequence of tuning parameters {b t }, the theoretical requirements on which are detailed in (2).The quantity t and α can be determined by users, but the constant C 1 and the sub-Gaussian parameter σ remains unknown.This leads to the demand on a tuning parameter selection method in practice.In some situations, one would be able to have independent copies of data generated from no change point models.Recall that in Theorem 1, we control the overall Type-I error over the whole time.In practice, one may wish to set a limit of time, say T , and estimate the empirical Type-I errors in the time course [1, T ], in order to tune the thresholds.In this sense, the first variant we mentioned in Section 2.4 is handier.The tuning parameter can be chosen by setting the average run length equal to a pre-specified γ.
The second issue is computational complexity.The cumulative sum statistic defined in Definition 1 can be rewritten as Using this, to proceed to time point t ≥ 1, one can store all { s i=1 X i } t s=1 and the computational cost of Algorithm 1 is therefore of order O(t) but the storage is also of order O(t).As an alternative, one recalculates everything every time, then there is no requirement on storage and the computational cost is of order O(t 2 ).
One can also apply different variants in order to further ease the computational complexity.Here we discuss three alternatives.
• Instead of calculating the cumulative sum statistic for every integer pair (s, t), 1 ≤ s < t, one could set a window width h and only care about integers which are multipliers of h.The window width h can be regarded as users' tolerance on accuracy.
• For each integer t ≥ 2, instead of maximising the cumulative sum statistic's value over all integers s ∈ [1, t), one could just calculate D t−N,t , for and t > N.
In Algorithm 1, we only need to check if D t−N,t exceeds the pre-specified thresholds.The computational complexity of this alternative is of order O(t) and the storage cost is of order O(1).The caveat of this alternative is that one needs to carefully tune N , which is essentially the detection delay order.
• A final note, Algorithm 2 has computational cost of order O(t), when proceeding to time point t.

Multiple change points
It is natural to extend the at most one change point scenario to the multiple change points scenario.One keeps collecting data points and making decisions on whether there exists a change point.The procedure continues even if a change point is declared, until the experiment is terminated either by the experimenter or due to the lack of new data.In order to deal with this situation, we need a refined setup.
In Assumption 4, we define ∆ as the minimal spacing between two consecutive change points and κ as the minimal jump size.We are now ready to introduce a consistency concept.
Two pieces of takeaway message from Definition 3. The first inequality in (9) ensures that there will be no false alarms and the convergence to zero guarantees that the ratio of the detection delay to the minimal spacing vanishing as the minimal spacing goes unbounded.
Algorithm 3 is a generalization of Algorithm 1.The essence is to refresh the procedure every time when a change point is declared.The guarantee of Algorithm 3 is provided in Theorem 6.In order to ensure the outputs of Algorithm 3 are consistent in the sense detailed in Definition 3, we need a slightly stronger version of Assumption 3. Assumption 5. Assume that for any α ∈ (0, 1) and ξ > 0, there exists a sufficiently large absolute constant C SNR > 0 such that ∆κ 2 σ −2 ≥ C SNR log 1+ξ (∆/α).
The only difference between Assumptions 3 and 5 is the quantity ξ, the role of which will be explained after Theorem 6. Theorem 6.For any α ∈ (0, 1), let C be the output of Algorithm 3 with inputs satisfying the following.
• The cumulative sum statistic D e,s,t is defined in Definition 2.
where C d is an absolute constant satisfying 4C 2 4 C SNR /(C SNR − 4C 2 4 ) < C d < C SNR .The proof of Theorem 6 is in Appendix B and the condition on C 4 is in (15).The idea of the proof is, for any k ≥ 1, we show that Once a change point is declared, the procedure is refreshed at the latest change point estimator.
Due to the fact that and Assumption 5, we can then detect η k+1 with which shows the consistency defined in Definition 3. As we can now see that the quantity ξ > 0 introduced in Assumption 5 is to ensure the vanishing rate of the ratio of the localisation error to the minimal spacing.
As for the computational cost, since we refresh the whole procedure whenever a change point is declared, the computational cost of Algorithm 3 is the number of declared change points multiplied by the computational cost of Algorithm 1.

Optimality results in existing literature
In the existing literature, the optimality of the detection is studied.Lorden (1971) adopted the results developed in Wald (2004) and showed that, among all the estimators which have average run length at least γ > 0, it holds that, in our notation, Based on (10), the results we have shown in Section 2 are optimal, save for logarithmic factors.To see this, we use Theorem 1 in Lai (1998) to illustrate.Translated into our notation, Theorem 1 in Lai (1981) reads as, as γ → ∞, Let γ = 1/α, we have the order of the minimax lower bound of the detection delay is κ −2 σ 2 log(1/α), which is exactly the results we obtained in Theorem 1.

Optimal detection delay and phase transition
Having established these connections, it is not enough to claim the optimality of our methods yet, since in all the previous work, the lower bound is established for two fixed distributions, only allowing the change point location to vary.In this paper, we are concerned with the lower bounds with regard to the signal-to-noise ratio κ 2 ∆/σ 2 .
The following proposition and its proof are adaptations of Theorem 2 and its proof in Lai (1998).
Proposition 7. Assume {X i } i=1,2,... is a sequence of independent Gaussian random variables satisfying E(X i ) = f i , var(X i ) = σ 2 and Assumption 2. Denote the joint distribution of {X i } i=1,2,... as P κ,σ,∆ .Consider the class of estimators D defined as where pr ∞ indicates ∆ = ∞.Then for α small enough, we have that Remark 1. Proposition 7 holds for small enough α.To be specific, in the proof of Proposition 7, we require that α + 2α 1/4 < 1/2 and α All the constants in (12) can be improved by more refined analysis.
It follows from Proposition 7 that our results are nearly optimal off by a factor log(∆).The optimality can be seen in two aspects.
Firstly, the upper bound on the detection delay rate derived in Theorem 1, σ 2 κ −2 log (∆/α) is nearly optimal off by a factor log(∆).However, the result in ( 11) is about the expectation and what we provide in Theorem 1 is a high probability result.
Secondly, the signal-to-noise ratio condition we imposed in Assumption 5 that is also nearly optimal, off by a logarithmic factor.To see this, we assume that Under ( 13), the claim ( 11) means that which means the detection delay is larger than ∆ and which is not vanishing as ∆ → ∞ or α → 0. This is therefore inconsistent in the sense of Definition 3.This reveals a phase transition in the parameter space.To be specific, • in the low signal-to-noise ratio regime, i.e. κ 2 ∆/σ 2 log(1/α), no algorithm is guaranteed to produce consistent change point estimator; • in the high signal-to-noise ratio regime, i.e. κ 2 ∆/σ 2 log 1+ξ (∆/α), Algorithm 1 is able to produce consistent change point estimators with nearly minimax optimal detection delay.

Connections with offline change point detection problems
A closely related area is the offline change point detection, where one has data {X i } T i=1 and seeks change point estimators { η k } ⊂ {1, . . ., T }.The online and offline change point analysis shares many similarities.The offline change point results we list below can be found in Wang et al. (2018).
The signal-to-noise ratio and the phase transition.Let ∆ be the minimal spacing between two consecutive change points in the offline setting.We remark that in both online and offline problems, the signal-to-noise ratio is of the same form κ 2 ∆/σ 2 .In both problems, the parameter spaces can be partitioned into feasibility and infeasibility regimes by this signal-to-noise ratio.In Wang et al. (2018), it is shown that in the univariate offline change point detection problem, both the lower and upper bounds of the signal-to-noise ratio are of order log(T ).This sheds some light that the logarithmic factor between the lower bound we established in Proposition 7 and ( 13), and the upper bound we assumed in Assumption 3, is due to a loose lower bound but not the upper bound.
The estimation errors in both these two problems have a minimax lower bound σ 2 /κ 2 , and the upper bounds we achieve are both nearly optimal off by a logarithmic factor.
In addition to the similarities, there are also some noteworthy differences.The asymptotic regimes.In the offline settings, one let the total number of points T to diverge, and all other model parameters are functions of T .In the online settings, there is no such total number of time points, and the asymptotic regime is to let ∆ diverge.This is also reflected in the logarithm factors in the signal-to-noise conditions.
When deriving the estimation error, since one has collected all the data in advance in the offline setting, the signal-to-noise ratio is lower bounded by log(T ) and as a result, the estimation error only depends on the model parameters.This is not the case in the online setting, where the total number of data points examined is also a random variable.In this case, additional information is needed.In Theorem 1, we choose to control the upper bound of the Type-I error α.As a result, the estimation error, i.e. the detection delay, is a function also of α.

Discussions
In this paper, we have conducted a thorough and systematic study on various aspects of the online change point detection problem.Despite the problem we studied in this paper is a univariate one, the arguments are all in a high-dimensional fashion, i.e. all the results are fixed sample results and all parameters are allowed to vary.The framework we established in this paper can be used to study different high-dimensional problems.This will be left as future work.
In order to control the fluctuations, in this paper we have adopted two different versions of proofs.Theorems 1 and 6 are based on the results in Lemmas 8 and 9.These two lemmas adopted peeling-type arguments.The propositions in Section 2.4 are proved purely based on union bound arguments.One would think in the sequential analysis, using peeling arguments, the upper bounds on the fluctuations can be of iterative logarithmic order.However, this does not seem true for the cumulative sum statistics, and the problematic part is nailed down to which can not be written as a martingale even in the simplest case.Due to this, using peeling arguments and using union bound arguments both result in the logarithmic upper bound on the fluctuations, and actually it is much easier to prove purely based on the union bound arguments.One can see this from the comparisons of the proofs of Lemmas 8, 9 and those of the propositions in Section 2.4.In this paper, we still made an effort to use the peeling arguments for the main results.The reason is, if one only uses union bound arguments, then the fluctuations bound b t in Theorem 1 is But by using peeling arguments, we can achieve the bound As for Theorem 6, only using union bound arguments will lead to the fluctuations upper bound , while we have σ c 1/2 {3 log(t) + 2 loglog(t) + 4 log(2) − loglog(2) − log(α)} 1/2 by using a peeling argument.

Appendices
The concentration inequalities lemmas are in Appendix A and the proofs of the main theorems and propositions are in Appendix B.

A Concentration inequalities
Lemma 8.For any α > 0, it holds that where C 1 > 0 is an absolute constant.
Lemma 9.For any α > 0, it holds that where C 3 > 0 is an absolute constant.
For any e > 0, let where c > 0 is such that for any ζ > 0, pr {|W | ≥ ζ} < 2 exp(−cζ 2 /σ 2 ), and W is a mean zero sub-Gaussian random variable with W ψ 2 ≤ σ.Due to the sub-Gaussianity, we have that For simplicity, we let where C 4 satisfies that We therefore completes the proof.

B Proofs of theorems and propositions
Proof of Theorem 1.
Step 1. Define the event It follows from Lemma 8 that P{B} > 1 − α.
On the event B, for any s, t ∈ N , 1 ≤ s < t, it holds that D s,t − D s,t < b t , which implies that Step 2. If t ≤ ∆, then D s,t = 0, for all s ∈ [1, t).It follows from ( 16), on the event B, t > ∆ and d > 0.
Step 3. Now we consider t > ∆.For any t > ∆, if there exists s ∈ We then have d ≤ t − ∆.
When ∆ ≤ s < t, we have and on the event B, Then we denote Step 3. Let m = t − ∆.We have d ≤ m and ∆κ t − s ts It suffices to find the smallest integer m such that max The above is equivalent to This is equivalent to find the smallest integer m such that We now discuss that with the absolute constant the choice m = ⌈C d log(∆/α)σ 2 κ −2 ⌉ satisfies (17).Equation ( 18) is not an empty set provided that It follows from Assumption 3 that m ∆κ 2 4C 2 1 σ 2 − log {(m + ∆)/α} ≥ 2∆ log(∆/α) ≥ ∆ log{(m + ∆)/α}, which completes the proof.

Proof of Proposition 2. It follows from
Step 1 in the proof of Theorem 1 that on the event B, it holds that D s,t < D s,t + C 1 σ log 1/2 (t/α), 1 ≤ s < t.

It follows from
Step 2 in the proof of Theorem 1 that we only need to consider t > ∆.This leaves us two situations s ≤ ∆ and s ≥ ∆.In fact in both these two situations, one only needs to deal with the case s = ∆, therefore we only show s ≥ ∆ here.When ∆ ≤ s < t, we have D s,t = κ∆{(t − s)/ts} 1/2 and therefore on the event B, we have that where the last inequality follows from (4).We therefore completes the proof.
Proof of Theorem 3. The result (5) and the first part of ( 6) are immediate consequences of Steps 1 and 2 in the proof of Theorem 1.
As for the second part of ( 6), we have that , where the second inequality follows from the proof of Lemma 8. Then We complete the proof.
Proof of Proposition 4.
Step 1. Define the event We have that 1 − pr{C} where c > 0 is such that for any ζ > 0, pr {|W | ≥ ζ} < 2 exp(−cζ 2 /σ 2 ), W is a mean zero sub-Gaussian random variable with W ψ 2 ≤ σ and Therefore we have that Step 2. We have that We then have d ≤ t − ∆.When ∆ < s < t, we have and on the event C, Then we denote Step 3. Let m = t − ∆.We have that d ≤ m and ∆κ t − s ts It suffices to find the smallest integer m such that max The above is equivalent to This is equivalent to find the smallest integer m such that We now discuss that with the absolute constant satisfies (20).It follows from ( 7) that m < ∆.We then have which completes the proof.
On the event E, for any 0 ≤ e < s < t, it holds that In addition, due to Assumption 5, we have that Then it suffices to show that (i) for any refresh starting point of the algorithm e and any interval (e, t] not containing any true change points, on the event E, there is no detected change point; (ii) on the event E, we can detect η k with delay upper bounded by C d σ 2 log(∆/α)κ −2 k .
Step 2. As for (i), it holds automatically due to the definition of the event E. The claim (i) leads to that t k > η k .
Step 3. As for (ii), we prove by induction.When k = 0, we have t k = η k = 0, then η 1 − t 0 ≥ ∆ ≥ 3∆/4.It follows from identical arguments in the proof of Theorem 1 that Due to (23), we have that η 2 − t 1 ≥ 3∆/4, then due to Algorithm 3, the procedure restarts by setting e = t 1 .For a general k ≥ 1, if η k − t k−1 ≥ 3∆/4, then it follows from from identical arguments in the proof of Theorem 1 that which completes the proof.
Proof of Proposition 7.
Step 1.For any n, let P n be the restrictions of a distribution P to F n , i.e. the σ-field generated by the observations {X i } n i=1 .For any ν ≥ 1 and n ≥ ν, we have that for any n ≥ ∆, it holds that For any ν ≥ 1, define the event Then we have where the first two inequalities follow from the definition of E ν , and the last inequality follows from the definition of D.
Step 2. For any ν ≥ 1 and T ∈ D, since {T ≥ ν} ∈ F ν−1 , we have that pr where the fourth inequality follows from the Hoeffding inequality and a union bound argument, and the last inequality holds for small enough α such that Since the upper bound is independent of ν, it holds that Step 3. We now have σ 2 4κ 2 log(1/α), where the first inequality is due to Markov's inequality, the second is due to (26) and the definition of the class D, and the last holds when α + 2α 1/4 < 1/2.

D
e,s,t − D e,s,t < b t , which implies that D e,s,t + b e,t > D e,s,t > D e,s,t − b e,t .
where P κ,σ,∞ indicates the joint distribution under which there is no change point and