Oracle Estimation of a Change Point in High Dimensional Quantile Regression

In this paper, we consider a high-dimensional quantile regression model where the sparsity structure may differ between two sub-populations. We develop $\ell_1$-penalized estimators of both regression coefficients and the threshold parameter. Our penalized estimators not only select covariates but also discriminate between a model with homogeneous sparsity and a model with a change point. As a result, it is not necessary to know or pretest whether the change point is present, or where it occurs. Our estimator of the change point achieves an oracle property in the sense that its asymptotic distribution is the same as if the unknown active sets of regression coefficients were known. Importantly, we establish this oracle property without a perfect covariate selection, thereby avoiding the need for the minimum level condition on the signals of active covariates. Dealing with high-dimensional quantile regression with an unknown change point calls for a new proof technique since the quantile loss function is non-smooth and furthermore the corresponding objective function is non-convex with respect to the change point. The technique developed in this paper is applicable to a general M-estimation framework with a change point, which may be of independent interest. The proposed methods are then illustrated via Monte Carlo experiments and an application to tipping in the dynamics of racial segregation.


Introduction
In this paper, we consider a high-dimensional quantile regression model where the sparsity structure (e.g., identities and effects of contributing regressors) may differ between two subpopulations, thereby allowing for a possible change point in the model. Let Y ∈ R be a response variable, Q ∈ R be a scalar random variable that determines a possible change point, and X ∈ R p be a p-dimensional vector of covariates. Here, Q can be a component of X, and p is potentially much larger than the sample size n. Specifically, high-dimensional quantile regression with a change point is modelled as follows: ( 1.1) where (β T 0 , δ T 0 , τ 0 ) is a vector of unknown parameters and the regression error U satisfies P(U ≤ 0|X, Q) = γ for some known γ ∈ (0, 1). Unlike mean regression, quantile regression analyzes the effects of active regressors on different parts of the conditional distribution of a response variable. Therefore, it allows the sparsity patterns to differ at different quantiles and also handles heterogeneity due to either heteroskedastic variance or other forms of nonlocation-scale covariate effects. By taking into account a possible change point in the model, we provide a more realistic picture of the sparsity patterns. For instance, when analyzing high-dimensional gene expression data, the identities of contributing genes may depend on the environmental or demographical variables (e.g., exposed temperature, age or weights).
In this paper, we consider estimating regression coefficients α 0 ≡ (β T 0 , δ T 0 ) T as well as the threshold parameter τ 0 and selecting the contributing regressors based on 1 -penalized estimators. One of the strengths of our proposed procedure is that it does not require to know or pretest whether δ 0 = 0 or not, that is, whether the population's sparsity structure and covariate effects are invariant or not. In other words, we do not need to know whether the threshold τ 0 is present in the model.
For a sparse vector v ∈ R p , we denote the active set of v as J(v) ≡ {j : v j = 0}. One of the main contributions of this paper is that our proposed estimator of τ 0 achieves an oracle property in the sense that its asymptotic distribution is the same as if the unknown active sets J(β 0 ) and J(δ 0 ) were known. Importantly, we establish this oracle property without assuming a perfect covariate selection, thereby avoiding the need for the minimum level condition on the signals of active covariates.
The proposed estimation method in this paper consists of three main steps: in the first step, we obtain the initial estimators of α 0 and τ 0 , whose rates of convergence may be suboptimal; in the second step, we re-estimate τ 0 to obtain an improved estimator of τ 0 that converges at the rate of O P (n −1 ) and achieves the oracle property mentioned above; in the third step, using the second step estimator of τ 0 , we update the estimator of α 0 .
In particular, we propose two alternative estimators of α 0 , depending on the purpose of estimation (prediction vs. variable selection).
The most closely related work is Lee et al. (2016). However, there are several important differences: first, Lee et al. (2016) consider a high-dimensional mean regression model with a homoskedastic normal error and with deterministic covariates; second, their method consists of one-step least squares estimation with an 1 penalty; third, they derive non-asymptotic oracle inequalities similar to those in Bickel et al. (2009) but do not provide any distributional result on the estimator of the change point. Compared to Lee et al. (2016), dealing with high-dimensional quantile regression with an unknown change point calls for a new proof technique since the quantile loss function is different from the least squares objective function and is non-smooth. In addition, we allow for heteroskesdastic and non-normal regression errors and stochastic covariates. These changes coupled with the fact that the quantile regression objective function is non-convex with respect to the threshold parameter τ 0 raise new challenges. It requires careful derivation and multiple estimation steps to establish the oracle property for the estimator of τ 0 and also to obtain desirable properties of the estimator of α 0 . The technique developed in this paper is applicable to a general M-estimation framework with a change point, which may be of independent interest.
One particular application of (1.1) comes from tipping in the racial segregation in social sciences (see, e.g. Card et al., 2008). The empirical question addressed in Card et al. (2008) is whether and the extent to which the neighborhood's white population decreases substantially when the minority share in the area exceeds a tipping point (or change point). In Section 7, we use the US Census tract dataset constructed by Card et al. (2008) and confirm that the tipping exists in the neighborhoods of Chicago.
The remainder of the paper is organized as follows. Section 2 provides an informal description of our estimation methodology. In Section 3, we derive the consistency of the estimators in terms of the excess risk. Further asymptotic properties of the proposed estimators are given in Sections 4 and 5. In Section 6, we present the results of extensive Monte Carlo experiments. Section 7 illustrates the usefulness of our method by applying it to tipping in the racial segregation. Section 8 concludes and Appendix A describes in detail regarding how to construct the confidence interval for τ 0 . In Appendix B, we provide a set of regularity assumptions to derive asymptotic properties of the proposed estimators in Sections 4 and 5. Online supplements are comprised of 6 appendices for all the proofs as well as additional theoretical and numerical results that are left out for the brevity of the paper.
Notation. Throughout the paper, we use |v| q for the q norm for a vector v with q = 0, 1, 2. We use |v| ∞ to denote the sup norm. For two sequences of positive real numbers a n and b n , we write a n b n and equivalently b n a n if a n = o(b n ). If there exists a positive finite constant c such that a n = c · b n , then we write a n ∝ b n . Let λ min (A) denote the minimum eigenvalue of a matrix A. We use w.p.a.1 to mean "with probability approaching one." We write θ 0 ≡ β 0 + δ 0 . For a 2p dimensional vector α, let α J and α J c denote its subvectors formed by indices in J(α 0 ) and {1, ..., 2p} \ J(α 0 ), respectively. Likewise, let X J (τ ) denote the subvector of X(τ ) ≡ (X T , X T 1{Q > τ }) T whose indices are in J(α 0 ). The true parameter vectors β 0 , δ 0 and θ 0 (except τ 0 ) are implicitly indexed by the sample size n, and we allow that the dimensions of J(β 0 ), J(δ 0 ) and J(θ 0 ) can go to infinity as n → ∞. For simplicity, we omit their dependence on n in our notation. We also use the terms 'change point' and 'threshold' interchangeably throughout the paper.
(2.1) By construction, τ 0 is not unique when δ 0 = 0. However, if δ 0 = 0, then the model reduces to the linear quantile regression model in which β 0 is identifiable under the standard assumptions. In Appendix C.1, we provide sufficient conditions under which α 0 and τ 0 are identified when δ 0 = 0.
Suppose we observe independent and identically distributed samples {Y i , X i , Q i } i≤n . Let X i (τ ) and X ij (τ ) denote the i-th realization of X(τ ) and j-th element of X i (τ ) , respectively, i = 1, . . . , n and j = 1, . . . , 2p, so that X ij (τ ) ≡ X ij if j ≤ p and X ij (τ ) ≡ X i,j−p 1{Q i > τ } otherwise. Define In addition, let D j (τ ) ≡ {n −1 n i=1 X ij (τ ) 2 } 1/2 , j = 1, . . . , 2p. We describe the main steps of our 1 -penalized estimation method. For some tuning parameter κ n , define: Step 1: (α,τ ) = argmin α∈A,τ ∈T R n (α, τ ) + κ n 2p j=1 D j (τ )|α j |. (2.2) This step produces an initial estimator (α,τ ). The tuning parameter κ n is required to satisfy κ n ∝ (log p)(log n) log p n . (2.3) Note that we take κ n that converges to zero at a rate slower than the standard (log p/n) 1/2 rate in the literature. This modified rate of κ n is useful in our context to deal with an unknown τ 0 . A data-dependent method of choosing κ n is discussed in Section 2.3.
Remark 2.2. The computational cost in Step 1 is the multiple of grid points to the computational time of estimating the linear quantile model with an 1 penalty, which is solvable in polynomial time (see e.g. Belloni and Chernozhukov (2011) and Koenker and Mizera (2014) among others).
The main purpose of the first step is to obtain an initial estimator of α 0 . The achieved convergence rates of this step might be suboptimal due to the uniform control of the score functions over the space T of the unknown τ 0 .
In the second step, we introduce our improved estimator of the change point τ 0 . It does not use a penalty term, while using the first step estimator of α 0 . Define: Step 2: τ = argmin τ ∈T R n (α, τ ), (2.4) whereα is the first step estimator of α 0 in (2.2). In Section 4, we show that when τ 0 is identifiable, τ is consistent for τ 0 at a rate of n −1 . Furthermore, we obtain the limiting distribution of n( τ − τ 0 ), and establish conditions under which its asymptotic distribution is the same as if the true α 0 were known, without a perfect model selection on α 0 , nor assuming the minimum signal condition on the nonzero elements of α 0 .
In the third step, we update the Lasso estimator of α 0 using a different value of the penalization tuning parameter and the second step estimator of τ 0 . In particular, we recommend two different estimators of α 0 : one for the prediction and the other for the variable selection, serving for different purposes of practitioners. For two different tuning parameters ω n and µ n whose rates will be specified later by (2.7) and (4.1), define: Step 3a (for prediction):

5)
Step 3b (for variable selection): where τ is the second step estimator of τ 0 in (2.4), and the "signal-adaptive" weight w j in (2.6), motivated by the local linear approximation of the SCAD penalties (Fan and Li, 2001;Zou and Li, 2008), is calculated based on the Step 3a estimator α from (2.5): Here a > 1 is some prescribed constant, and a = 3.7 is often used in the literature. We take this as our choice of a.
Remark 2.3. For α in (2.5), we set ω n to converge to zero at a rate of (log(p ∨ n)/n) 1/2 : which is a more standard rate compared to κ n in (2.3)). Therefore, the estimator α converges in probability to α 0 faster thanα. In addition, µ n in (2.6) is chosen to be slightly larger than ω n for the purpose of the variable selection. A data-dependent method of choosing ω n as well as µ n is discussed in Section 2.3. In Sections 4 and 5, we establish conditions under which α achieves the (minimax) optimal rate of convergence in probability for α 0 regardless of the identifiability of τ 0 .
Remark 2.4. It is well known in linear models without the presence of an unknown τ 0 (see, e.g. Bühlmann and van de Geer (2011)) that the Lasso estimator may not perform well for the purpose of the variable selection. The estimator α defined in Step 3b uses an entry-adaptive weight w j that corrects the shrinkage bias, and possesses similar merits of the asymptotic unbiasedness of the SCAD penalty. Therefore, we recommend α for the prediction; while suggesting α for the variable selection.
Remark 2.5. Note that the objective function is non-convex with respect to τ in the first and second steps. However, the proposed estimators can be calculated efficiently using existing algorithms, and we describe the computation algorithms in Section 2.3.
Step 2 can be repeated using the updated estimator of α 0 in Step 3. Analogously, Step 3 can be iterated after that. This would give asymptotically equivalent estimators but might improve the finite-sample performance especially when p is very large.

Repeating
Step 2 might be useful especially whenδ = 0 in the first step. In this case, there is no unique τ in Step 2. So, we skip the second step by setting τ =τ and move to the third step directly. If a preferred estimator of δ 0 in the third step (either δ or δ), depending on the estimation purpose, is different from zero, we could go back to Step 2 and re-estimate τ 0 . If the third step estimator of δ 0 is also zero, then we conclude that there is no change point and disregard the first-step estimatorτ since τ 0 is not identifiable in this case.

Comparison of Estimators in Step 3
Step 3 defines two estimators for α 0 . In this subsection we briefly explain their major differences and purposes.
Step 3b is particularly useful when the variable selection consistency is the main objective, yet it often requires the minimum signal condition (min α 0j =0 |α 0j | is well separated from zero). In contrast, Step 3a does not require the minimum signal condition, and is recommended for prediction purposes. More specifically: 1. If the minimum signal condition (5.1) indeed holds, a perfect variable selection (variable selection consistency) is possible. Indeed, thanks to the signal-adaptive weights, the estimator of Step 3b introduces little shrinkage biases. As a result, we show in Theorem 4.5 that under very mild conditions, this estimator achieves the variable selection consistency. In contrast, Step 3a does not use signal-adaptive weights. In order to achieve the variable selection consistency, it has to rely on much stronger conditions on the design matrix (i.e., the irrepresentable condition of Zhao and Yu (2006)) so as to "balance out" the effects of shrinkage biases, and is less adaptive to correlated designs.
2. In the presence of the minimum signal condition, not only does Step 3b achieve the variable selection consistency, it also has a better rate of convergence than Step 3a (Theorem 4.5). The faster rate of convergence is built on the variable selection consistency, and is still a consequence of the signal-adaptive weights. Intuitively, nonzero elements of α 0 are easier to identify and estimate when the signal is strong. Such a phenomenon has been observed in the literature; see, e.g., Fan and Lv (2011) and many papers on variable selections using "folded-concave" penalizations.
3. In the absence of the minimum signal condition, neither method can achieve variable selection consistency. However, it is not a requirement for the prediction purpose. In this case, we recommend the estimator of Step 3a, because it achieves a fast (minimax) rate of convergence (Theorem 4.4), which is useful for predictions.
4. Finally, we show in Theorem 4.7 that without the minimum signal condition, Step 3b, with the signal-adaptive weights, does not perform badly, in the sense that it still results in estimation and prediction consistency. However, the rate of convergence is slower than that of Step 3a.

Tuning parameter selection
In this subsection, we provide details on how to choose tuning parameters in applications.
Recall that our procedure involves three tuning parameters in the penalization: (1) κ n in Step 1 ought to dominate the score function uniformly over the range of τ , and hence should be slightly larger than the others; (2) ω n is used in Step 3a for the prediction, and (3) µ n in Step 3b for the variable selection should be larger than ω n . Note that the tuning parameters in both Steps 3a and 3b are similar to those of the existing literature since the change point τ has been estimated.
We build on the data-dependent selection method in Belloni and Chernozhukov (2011).
where U i is simulated from the i.i.d. uniform distribution on the interval [0, 1]; γ is the quantile of interest (e.g. γ = 0.5 for median regression). Note that Λ(τ ) is a stochastic process indexed by τ . Let Λ 1− * be the (1 − * )-quantile of sup τ ∈T Λ(τ ), where * is a small positive constant that will be selected by a user. Then, we select the tuning parameter in Step 1 by κ n = c 1 · Λ 1− * . Similarly, let Λ 1− * ( τ ) be the (1 − * )-quantile of Λ( τ ), where τ is chosen in Step 2. We select ω n and µ n in Step 3 by ω n = c 1 · Λ 1− * ( τ ) and µ n = c 2 · ω n . It is also necessary to choose T in applications. In our Monte Carlo experiments in Section 6, we take T to be the interval from the 15th percentile to the 85th percentile of the empirical distribution of the threshold variable Q i . For example, Hansen (1996) employed the same range in his application to U.S. GNP dynamics.
Based on the suggestions of Belloni and Chernozhukov (2011) and some preliminary simulations, we choose to set c 1 = 1.1, c 2 = log log n, and * = 0.1. In addition, recall that we set a = 3.7 when calculating the SCAD weights w j in Step 3b following the convention in the literature (e.g. Fan and Li (2001) and Loh and Wainwright (2013)). In Step 1, we first solve the lasso problem for α given each grid point of τ ∈ T . Then, we chooseτ and the correspondingα(τ ) that minimize the objective function.
Step 2 can be solved simply by the grid search.
Step 3 is a standard lasso quantile regression estimation given τ , whose numerical implementation is well established. We use the rq() function of the R 'quantreg' package with the method = "lasso" in each implementation of the standard lasso quantile regression estimation (Koenker, 2016).
What we mean by the "risk consistency" here is that the excess risk converges in probability to zero for the proposed estimators. The other asymptotic properties of the proposed estimators will be presented in Sections 4 and 5.
In this section, we begin by stating regularity conditions that are needed to develop our first theoretical result. Recall that X ij denotes the jth element of X i .
(iii) α 0 ∈ A ≡ {α : |α| ∞ ≤ M 1 } for some constant M 1 < ∞, and τ 0 ∈ T ≡ [τ , τ ]. Further-more, the probability of {Q < τ } and that of {Q > τ } are strictly positive, and (iv) There exist universal constants D > 0 and D > 0 such that w.p.a.1, (v) E X T δ 0 2 |Q = τ ≤ M 2 |δ 0 | 2 2 for all τ ∈ T and for some constant M 2 satisfying In addition to the random sampling assumption, condition (i) imposes mild moment restrictions on X. Condition (ii) imposes a weak restriction that the probability that Q ∈ (τ 1 , τ 2 ] is bounded by a constant times (τ 2 − τ 1 ). Condition (iii) assumes that the parameter space is compact and that the support of Q is strictly larger than T . These conditions are standard in the literature on change-point and threshold models (e.g., Seijo and Sen (2011a,b)). Condition (iii) also assumes that the conditional expectation of E[X 2 ij |Q = ·] is bounded on T uniformly in j. Condition (iv) requires that each regressor be of the same magnitude uniformly over the threshold τ . As the data-dependent weights D j (τ ) are the sample second moments of the regressors, it is not stringent to assume them to be bounded away from both zero and infinity. Condition (v) puts some weak upper bound on is that the eigenvalues of E[X J(δ 0 ) X T J(δ 0 ) |Q = τ ] are bounded uniformly in τ , where X J(δ 0 ) denotes the subvector of X corresponding to the nonzero components of δ 0 .
Throughout the paper, we let s ≡ |J(α 0 )| 0 , namely the cardinality of J(α 0 ). We allow that s → ∞ as n → ∞ and will give precise regularity conditions regarding its growth rates.
The following theorem is concerned about the convergence of R(α,τ ) with the first step estimator.
Note that Theorem 3.1 holds regardless of the identifiability of τ 0 (that is, whether δ 0 = 0 or not). In addition, the rate O P (κ n s) is achieved regardless of whether κ n s converges, and we have the risk consistency if κ n s → 0 as n → ∞. The restriction on s is slightly stronger than that of the standard result s = o( n/ log p) in the literature for the M-estimation (see, e.g. van de Geer (2008) and Chapter 6.6 of Bühlmann and van de Geer (2011)) since the objective function ρ(Y, X(τ ) T α) is non-convex in τ , due to the unknown change-point.
Remark 3.1. The extra logarithmic factor (log p)(log n) in the definition of κ n (see (2.3)) is due to the existence of the unknown and possibly non-identifiable threshold parameter τ 0 . In fact, an inspection of the proof of Theorem 3.1 reveals that it suffices to assume that κ n satisfies κ n log 2 (p/s)[log(np)/n] 1/2 . The term log 2 (p/s) and the additional (log n) 1/2 term inside the brackets are needed to establish the stochastic equicontinuity of the empirical In Appendix C.2, we show that an improved rate of convergence, O P (ω n s), is possible for the excess risk by taking the second and third steps of estimation.
4 Asymptotic Properties: Case I. δ 0 = 0 Sections 4 and 5 provide asymptotic properties of the proposed estimators. In Appendix B, we list a set of assumptions that are needed to derive these properties, in addition to Assumption 1. We first establish the consistency ofτ for τ 0 .
The following theorem presents the rates of convergence for the first step estimators of α 0 and τ 0 . Recall that κ n is the first-step penalization tuning parameter that satisfies (2.3).
Theorem 4.2 (Rates of Convergence When δ 0 = 0). Suppose that κ n s 2 log p = o(1). Then under Assumptions 1-6, we have: In Theorem 3.1, we have that R(α,τ ) = O P (κ n s) . The improved rate of convergence for R(α,τ ) in Theorem 4.2 is due to additional assumptions (in particular, compatibility conditions in Assumption 3 among others). It is worth noting thatτ converges to τ 0 faster than the standard parametric rate of n −1/2 , as long as s 2 (log p) 6 (log n) 4 = o(n). The main reason for such super-consistency is that the objective function behaves locally linearly around τ 0 with a kink at τ 0 , unlike in the regular estimation problem where the objective function behaves locally quadratically around the true parameter value. Moreover, the achieved convergence rate forα is nearly minimax optimal, with an additional factor (log p)(log n) compared to the rate of regular Lasso estimation (e.g., Bickel et al. (2009);Raskutti et al. (2011)). This factor arises due to the unknown change-point τ 0 . We will improve the rates of convergence for both τ 0 and α 0 further by taking the second and third steps of estimation.
Recall that the second-step estimator of τ 0 is defined as whereα is the first step estimator of α 0 in (2.2). Consider an oracle case for which α in R n (α, τ ) is fixed at α 0 . Let R * n (τ ) = R n (α 0 , τ ) and We now give one of the main results of this paper.
Theorem 4.3 (Oracle Estimation of τ 0 ). Let Assumptions 1-6 hold. Furthermore, suppose that κ n s 2 log p = o(1). Then, we have that Furthermore, n ( τ − τ 0 ) converges in distribution to the smallest minimizer of a compound Poisson process, which is given by where N 1 and N 2 are Poisson processes with the same jump rate f Q (τ 0 ), and {ρ 1i } and {ρ 2i } are two sequences of independent and identically distributed random variables. The distributions of ρ 1i and ρ 2i , respectively, are identical to the conditional distributions oḟ are mutually independent.
The first conclusion of Theorem 4.3 establishes that the second step estimator of τ 0 is an oracle estimator in the sense that it is asymptotically equivalent to the infeasible, oracle estimator τ . As emphasized in the introduction, the oracle property is obtained without relying on the perfect model selection in the first step nor on the existence of the minimum signal condition on active covariates. The second conclusion of Theorem 4.3 follows from combining well-known weak convergence results in the literature (see e.g. Pons (2003); Kosorok and Song (2007); Lee and Seo (2008)) with the argmax continuous mapping theorem by Seijo and Sen (2011b).
Remark 4.1. Li and Ling (2012) propose a numerical approach for constructing a confidence interval by simulating a compound Poisson process in the context of least squares estimation.
We adopt their approach to simulate the compound Poisson process for quantile regression.
See Appendix A for a detailed description of how to construct a confidence interval for τ 0 .
We now consider the Step 3a estimator of α 0 defined in (2.5). Recall that ω n is the Step 3a penalization tuning parameter that satisfies (2.7).
Theorem 4.4 (Improved Rates of Convergence When δ 0 = 0). Suppose that κ n s 2 log p = o(1). Then under Assumptions 1-6, Theorem 4.4 shows that the estimator α defined in Step 3a achieves the optimal rate of convergence in terms of prediction and estimation. In other words, when ω n is proportional to {log(p ∨ n)/n} 1/2 in equation (2.7) and p is larger than n, it obtains the minimax rates as in e.g., Raskutti et al. (2011).
As we mentioned in Section 2, the Step 3b estimator of α 0 has the purpose of the variable selection. The nonzero components of α are expected to identify contributing regressors.
We now establish conditions under which the estimator α defined in Step 3b has the change-point-oracle properties, meaning that it achieves the variable selection consistency and has the limiting distributions as though the identities of the important regressors and the location of the change point were known. Then under Assumptions 1-6, we have: (i) We see that (4.1) provides a condition on the strength of the signal via min j∈J(α 0 ) |α (j) 0J |, and the tuning parameter in Step 3b should satisfy ω n µ n and s 2 log s/n µ 2 n . Hence the variable selection consistency demands a larger tuning parameter than in Step 3a.
To conduct statistical inference, we now discuss the asymptotic distribution of α J . Define . Note that the asymptotic distribution for α * J corresponds to an oracle case that we know τ 0 as well as the true active set J(α 0 ) a priori. The limiting distribution of α J is the same as that of α * J . Hence, we call this result the change-pointoracle property of the Step 3b estimator and the following theorem establishes this property.
Theorem 4.6 (Change-Point-Oracle Properties). Suppose that all the conditions imposed in Theorem 4.5 are satisfied. Furthermore, assume that ∂ ∂α E ρ Y, X T α |Q = t exists for all t in a neighborhood of τ 0 and all its elements are continuous and bounded, and that Since the sparsity index (s) grows at a rate slower than the sample size (n), it is straightforward to establish the asymptotic normality of a linear transformation of α J , i.e., L α J , where L : R s → R with |L| 2 = 1, by combing the existing results on quantile regression with parameters of increasing dimension (see, e.g. He and Shao (2000)) with Theorem 4.6.
Remark 4.2. Without the condition on the strength of minimal signals, it may not be possible to achieve the variable selection consistency or establish change-point-oracle properties.
However, the following theorem shows that the SCAD-weighted penalized estimation can still achieve a satisfactory rate of convergence in estimation of α 0 without the condition that |. Yet, the rates of convergence are slower than those of Theorem 4.5.
Theorem 4.7 (Satisfactory Rates Without Minimum Signal Condition). Assume that Assumptions 1-6 hold. Suppose that κ n s 2 log p = o(1) and ω n µ n . Then, without the lower bound requirement on min j∈J(α 0 ) |α 5 Asymptotic Properties: Case II. δ 0 = 0 In this section, we show that our estimators have desirable results even if there is no change point in the true model. The case of δ 0 = 0 corresponds to the high-dimensional linear quantile regression model. Since there is no structural change on the coefficient. But a new analysis different from that of the standard high-dimensional model is still required because in practice we do not know whether δ 0 = 0 or not. Thus, the proposed estimation method still estimates τ 0 to account for possible structural changes. The following results show that in this case, the first step estimator of α 0 will asymptotically behave as if δ 0 = 0 were a priori known.
Theorem 5.1 (Rates of Convergence When δ 0 = 0). Suppose that κ n s = o(1). Then under Assumptions 1-4, we have that The results obtained in Theorem 5.1 combined with those obtained in Theorem 4.2 imply that the first step estimatior performs equally well in terms of rates of convergence for both the 1 loss forα and the excess risk regardless of the existence of the threshold effect. It is straightforward to obtain an improved rate result for the Step 3a estimator, equivalent to Theorem 4.4 under Assumptions 1-4. We omit the details for brevity.
We now give a result that is similar to Theorem 4.5 and Theorem 4.7.
Theorem 5.2 (Variable Selection When δ 0 = 0). Suppose that κ n s = o(1), s 4 log s = o(n), ω n + s log s n µ n , and Assumptions 1-4 hold. We have: (i) If the minimum signal condition holds: (ii) Without the minimum signal condition (5.1), we have: Theorem 5.2 demonstrates that when there is in fact no change point, our estimator for δ 0 is exactly zero with a high probability. Therefore, the estimator can also be used as a diagnostic tool to check whether there exists any change point. Results similar to Theorems 4.6 can be established straightforwardly as well; however, their details are omitted for brevity.

Monte Carlo Experiments
In this section we provide the results of Monte Carlo experiments. The baseline model is based on the following data generating process: for i = 1, . . . , n, where U i follows N (0, 0.5 2 ), and Q i follows the uniform distribution on the interval [0, 1].
The p-dimensional covariate X i is composed of a constant and Z i , i.e. X := (1, Z T i ) T , where Z i follows the multivariate normal distribution N (0, Σ) with a covariance matrix Σ ij = (1/2) |i−j| . Here, the variables U i , Q i and Z i are independent of each other. Note that the conditional γ-quantile of Y i given (X i , Q i ) has the form: where β γ = β 0 + ξ 10 · Quant γ (U ) and δ γ = δ 0 + ξ 20 · Quant γ (U ).
We consider three quantile regression models with γ = 0.25, 0.5, and 0.75. The p- We compare estimation results of each step. To assess the performance of our estimators, we also compare the results with two "oracle estimators". Specifically, Oracle 1 knows the true active set J(α γ ) and the change point parameter τ 0 , and Oracle 2 knows only J(α γ ).
The threshold parameter τ 0 is re-estimated in Steps 3a and 3b using updated estimates of Tables 1-3 summarize the simulation results. We abuse notation slightly and denote all estimators by ( α, τ ). They would be understood as (α,τ ) in Step 1, τ in Step 2, and so on. Oracle 1 knows both J(α γ ) and τ 0 and Oracle 2 knows only J(α γ ). Expectation (E) is calculated by the average of 1,000 iterations in each design. Note that J(α γ ) = 1. 'NA' denotes 'Not Available' as the parameter is not estimated in the step. The estimation results for τ at the rows of Step 3a and Step 3b are based on the re-estimation of τ given estimates from Step 3a ( α) and Step 3b ( α).  We report the excess risk, the average number of parameters selected, E[J( α)], and the sum of the mean squared error of α ( α J 0 / α J c 0 ). For each sample, the excess risk is calculated by the simulation, , where S = 10,000 is the number of simulations; then we report the average value of 1,000 replications. Similarly, we also calculate prediction errors by the simulation, S −1 S s=1 X T s ( τ ) α − X T s (τ 0 )α γ 2 1/2 , and report the average value.
We also report the root-mean-squared error (RMSE) and the coverage probability of the 95% confidence interval of τ (C. Prob. of τ ). The confidence intervals for τ 0 are calculated by simulating the two-sided compound Poisson process in Theorem 4.3 by adopting the approach proposed by Li and Ling (2012). The details are provided in Section A. Li and Ling (2012) showed that it is valid to simulate the compound poisson process by simulating the poisson process and the compounding factors from empirical distributions separately in the context of least squares estimation. We build on their suggestion and modify their procedure to quantile regression. We did not prove a formal justification for our procedure in this paper; however, it seems working well in simulations. It is an interesting topic for future research.
Note that the root-mean-squared error of τ and the coverage probability of the confidence interval at the rows of Step 3a and Step 3b in the tables are estimation results of updated τ : we re-estimate τ as in Step 2 using ( U i , α) and ( U i , α) from Step 3a and Step 3b instead of (Ȗ i ,α). Finally, we also report the oracle proportion (Oracle Prop.), namely the ratio of the correct model selection out of 1,000 replications.
Overall, the simulation results confirm the asymptotic theory developed in the previous sections. First, these results show the advantage of quantile regression models over the existing mean regression models with a change point, e.g. Lee et al. (2016). The proposed estimator (Step 3b) selects different nonzero coefficients at different quantile levels. The estimator in Lee et al. (2016) cannot detect these heterogeneous models. In general the proposed estimators show better performance for heteroskedastic designs and for the fat-tail error distributions as will be discussed in detail below. Second, when we look at the finite sample performance of the proposed estimators in Step 3, their prediction errors are within a reasonable bound from those of Oracles 1 and 2. Recall that we estimate models with 250 times or 500 times more regressors in each design. Third, the root-mean-squared error of τ decreases quickly and confirms the super-consistency result of τ . Fourth, the coverage probabilities of the confidence interval are close to 95%, especially when n = 400. Thus, we recommend practitioners to use τ in Step 2 or the re-estimated version of it based on the estimates from Step 3a or Step 3b. Finally, the oracle proportion of Step 3b is quite satisfactory and confirms our results in model selection consistency. Table 4 compares the performance of the proposed estimator with that of the mean regression method in Lee et al. (2016). For the purpose of direct comparison between mean and median regression models, the tuning parameter λ is fixed to be the same as that in

Comparison with Mean Regression with a Change Point
Step 1 from median regression. We consider three different simulation designs at γ = 0.5 with n = 200. The first model is a homoskedastic model by setting ξ 01 = (1, 0, . . . , 0) in the baseline design. The second model is the same as the heteroskedastic median regression in   Table 5 shows the performance of the estimator when there does not exist any change point. We use the baseline design with γ = 0.75 and set δ = (0, . . . , 0). As we are interested in the performance of δ, we report the average number of parameters selected in δ, the MSE of δ, and the proportion of detecting no-change point (No-change Prop.). As predicted by the theory, all measures on δ indicate that the estimator (Step 3b) detects no-change point models quite well. Both E[J( δ)] and MSE of δ are quite low and no-change proportion is high. We can also observe much improvement in these measure when the sample size increases from n = 200 to n = 400.
Note that the simulation design in Table 6 is the same as that reported in Table 2 except that δ 0.5 = (0, 1, 0 . . . , 0) in Table 2. Therefore, we may view that the simulation design in Table 2 satisfies the minimum signal condition, whereas that of this subsection does not.
The simulation results in Table 6 are consistent with asymptotic theory in Section 4 and remarks in Section 2.2 comparing estimators in step 3. The step 3b estimator performs better than the step 3a estimator in Table 2, but it performs worse in Table 6. Also note that the oracle proportion is zero for the step 3b estimator, which is expected given low signals in coefficients. Finally, it is important to note that the performance of the estimators of τ 0 is good in terms of the MSE in the presence of low signals in δ. The coverage probability of the confidence interval is much higher than the nominal level, which was not observed in previous simulations. Since the MSE and coverage probability between the infeasible oracle 2 estimator and other estimators are very similar, we interpret that the over-coverage result is not driven by high-dimensionality of regressors and variable selection. Perhaps this is due to a larger number of coefficients to estimate for the oracle 2 estimator, compared to Table   2. Table 6: When the minimal signal in δ is low

Additional Simulation Results
We have carried out additional Monte Carlo experiments. For the sake of brevity, we only report main findings here and show full results in the appendices. In Appendix F, we report simulation results when the change point τ 0 and the distribution of Q i vary. In particular, we consider three different distributions of Q i : Uniform[0, 1], N (0, 1), and χ 2 (1).
We find that the performance of τ measured by the root-mean-squared error depends on the density of Q i distribution, as is expected from asymptotic theory. For instance, it is quite uniform over different τ 0 when Q i follows Uniform[0, 1]. However, when Q i follows N (0, 1) or χ 2 (1), it performs better when τ 0 is located at a point with a higher density of Q i distribution. Sensitivity analyses provided in Appendix G show that the main simulation results are robust when we make changes over the range between −15% and +15% of the suggested tuning parameter values.
In summary, the proposed estimation procedure works well in finite samples and confirms the theoretical results developed earlier. The simulation studies show some advantages of the proposed estimator over the existing mean regression method, e.g. Lee et al. (2016). It also detects no-change-point models well without any pre-test. The main qualitative results are not sensitive to different simulation designs on τ 0 and Q i as well as to some variation on tuning parameter values.

Estimating a Change Point in Racial Segregation
As an empirical illustration, we investigate the existence of tipping in the dynamics of racial segregation using the dataset constructed by Card et al. (2008). They show that the neighborhood's white population decreases substantially when the minority share in the area exceeds a tipping point (or threshold point), using U.S. Census tract-level data. Lee et al.
(2011) develop a test for the existence of threshold effects and apply their test to this dataset.
Different from these existing studies, we consider a high-dimensional setup by allowing both possibly highly nonlinear effects of the main covariate (minority share in the neighborhood) and possibly higher-order interactions between additional covariates.
We build on the specifications used in Card et al. (2008) and Lee et al. (2011) to choose the following median regression with a constant shift due to the tipping effect: where for census tract i, the dependent variable Y i is the ten-year change in the neighborhood's white population, Q i is the base-year minority share in the neighborhood, and X i is a vector of six tract-level control variables and their various interactions depending on the model specification. Both Y i and Q i are in percentage terms. The basic six variables in X i include the unemployment rate, the log of mean family income, the fractions of single-unit, vacant, and renter-occupied housing units, and the fraction of workers who use public transport to travel to work. The function g(·) is approximated by the cubic b-splines with 15 knots over equi-quantile locations, so the degrees of freedom are 19 including an intercept term. In our empirical illustration, we use the census-tract-level sample of Chicago whose base year is 1980.
In the first set of models, we consider possible interactions among the six tract-level control variables up to six-way interactions. Specifically, the vector X in the six-way interactions will be composed of the following 63 regressors, where X (j) is the j-th element among those tract-level control variables. Note that the lower order interaction vector (e.g. two-way or three-way) is nested by the higher order interaction vector (e.g. three-way or four-way). The total number of regressors varies from 26 (19 from b-splines, 6 from X i and 1{Q i > τ }) when there is no interaction to 83 when there are full six-way interactions. In the next set of models, we add the square of each tract-level control variable and generate similar interactions up to six. In this case the total number of regressors varies from 32 to 2,529. For example, the number of regressors in the largest model consists of #(b-spline basis) + #(indicator function) + #(interactions up to six-way out of 12) = 19 + 1 + 6 k=1 12 k = 2, 529. This number is much larger than the sample size (n = 1, 813). Table 7 summarizes the estimation results at the 0.25, 0.5, and 0.75 quantiles, respectively. We report the total number of regressors in each model and the number of selected regressors in Step 3b. The change point τ is estimated by the grid search over 591 equispaced points in [1,60]. The lower bound value 1% corresponds to the 1.6 sample percentile of Q i and the upper bound value 60%, which is about the upper sample quartile of Q i , is the same as one used in Card et al. (2008). In this empirical example, we report the estimates of τ 0 and the confidence intervals updated after Step 3b (that is, τ is re-estimated using the estimates of α 0 in Step 3b). If this estimate is different from the previous one in Step 2, then we repeat Step 3b and Step 2 until it converges.
The estimation results suggest several interesting points. First, at each quantile, the proposed method selects sparse representations in all model specifications even when the number of regressors is relatively large. Furthermore, the number of selected regressors does not grow rapidly when we increase the number of possible covariates. It seems that the set of selected covariates overlaps across different dictionaries at each quantile. See Appendix H for details on selected regressors. Second, the estimation results are different across different quantiles, indicating that there may exist heterogeneity in this application. The confidence intervals for τ 0 at the 0.25 quantile are quite tight in all cases and they provide convincing evidence of the tipping effect. If we look at the case of six-way interactions with 12 control variables, the estimated tipping point is 5.65% and the estimated jump size is −5.50%.
However, this strong tipping effect becomes weaker at the 0.50 and 0.75 quantiles as shown either by wider confidence intervals or by the zero jump size, i.e. δ = 0.
We now compare the estimation results from quantile regression with those from mean regression, which are reported in Table 8 (full estimation results are in Appendix H). We show two kinds of mean regression estimates: one with the untrimmed original data and the other with the trimmed data for which we drop top and bottom 5% observations based on {Y i }. The estimated tipping points are the same between the two datasets but the estimated The sample size is n = 1, 813. The parameter τ 0 is estimated by the grid search on the 591 equi-spaced points over [1,60]. Both τ and the 95% confidence interval are based on re-estimation after Step 3b: that is, τ is estimated again using ( U i , α) from Step 3b. The sample size of the untrimmed original data is n = 1, 813. The trimmed data drop top and bottom 5% observations based on {Y i } and the sample sizes decreases to n = 1, 626. The parameter τ 0 is estimated by the grid search on the 591 equi-spaced points over [1,60]. As in the simulation studies, the tuning parameters are set from Step 1 in median regression.
jump size is much larger with the original data. Figure  The left panel of Figure 1 compares the results between the mean and median regression results (without trimming the data) and the right panel shows the the interquartile range of the conditional distribution of Y i as a function of Q i given other regressors. It can be seen that the mean regression estimates are much more volatile around the tipping point than the median regression estimates, although the estimated tipping point is the same. In Figure 2, we compare the mean regression estimates with and without trimming. Removing observations with top and bottom 5% Y i 's stablize the estimates, thus demonstrating that the median regression estimates have the built-in feature that they are more stable with outliers of Y i than the mean estimates. Finally, looking at the right panel of Figure 1, we can see that the 25 percentile of the conditional distribution drops at the tipping point of 5.65% but no such change at the 75% quantile. This shows that the quantile regression estimates can provide insights into distributional threshold effects in racial segregation.
In summary, this empirical example shows that the proposed method works well in the real empirical setup and is robust to outliers compared to the mean regression approach.
The estimation results also confirm that there exists a tipping point in the racial segregation

Conclusions
In this paper, we have developed 1 -penalized estimators of a high-dimensional quantile regression model with an unknown change point due to a covariate threshold. We have shown among other things that our estimator of the change point achieves an oracle property without relying on a perfect covariate selection, thereby avoiding the need for the minimum level condition on the signals of active covariates. We have illustrated the usefulness of our estimation methods via Monte Carlo experiments and an application to tipping in the racial segregation.
In a recent working paper, Leonardi and Bühlmann (2016) consider a high-dimensional mean regression model with multiple change points whose number may grow as the sample size increases. They have proposed a binary search algorithm to choose the number of change points. It is an important future research topic to develop a computationally efficient algorithm to detect multiple changes for high-dimensional quantile regression models.

Appendices
In Appendix A, we provide the algorithm of constructing the confidence interval for τ 0 .
In Appendix B, we provide a set of regularity assumptions to derive asymptotic properties of the proposed estimators in Sections 4 and 5. In Appendix C, we provide sufficient conditions for the identification of (α 0 , τ 0 ) in (2.1) and show that an improved rate of convergence is possible for the excess risk by taking the second and third steps of estimation. To prove the theoretical results in the main text, we consider a general M-estimation framework that includes quantile regression as a special case. We provide high-level regularity conditions on the loss function in Appendix D. Under these conditions, we derive asymptotic properties and then we verify all the high level assumptions for the quantile regression model in Appendix E.
Hence, our general results are of independent interest and can be applicable to other models, for example logistic regression models. Appendices F and G provide additional simulation results, and Appendix H gives additional results for the empirical example.
2. Using the residuals {Ȗ i } and the estimateδ from Step 1, simulate ρ 1j for j = 1, . . . , is the check function as defined in Section 4. It is straightforward to modify the algorithm above for the confidence intervals with Step 3a and

Recall that
Step 3b estimators. We set H = 0.5, and B = 1, 000 in this simulation studies.

B Assumptions for Oracle Properties
In this section, we list a set of assumptions that will be useful to derive asymptotic properties of the proposed estimators in Sections 4 and 5. In the following, we divide our discussions into two important cases: (i) δ 0 = 0 and τ 0 is identified, and (ii) δ 0 = 0 and thus τ 0 is not identified. The asymptotic properties are derived under both cases. Note that such a distinction is only needed for presenting our theoretical results. In practice, we do not need to know whether δ 0 = 0 or not.
Assumption 2 (Underlying Distribution). (i) The conditional distribution Y |X, Q has a continuously differentiable density function f Y |X,Q (y|x, q) with respect to y, whose derivative is denoted byf Y |X,Q (y|x, q).
(ii) There are constants C 1 , C 2 > 0 such that for all (y, x, q) in the support of (Y, X, Q), .
Conditions (i) and (ii) are standard assumptions for quantile regression models. To follow the notation in condition (iii), recall that α J denotes the subvector of α whose indices are in J(α 0 ). Expressions X J (τ ), X J(β 0 ) , α 0J and β 0J(β 0 ) can be understood similarly. Condition (iii) is a weak condition that imposes non-singularity of the Hessian matrix of the population objective function uniformly in a neighborhood of τ 0 in case of δ 0 = 0. This condition reduces to the usual non-singularity condition when δ 0 = 0.

B.1 Compatibility Conditions
We now make an assumption that is an extension of the well-known compatibility condition (see Bühlmann and van de Geer (2011), Chapter 6). In particular, the following condition is a uniform-in-τ version of the compatibility condition. Recall that for a 2p dimensional vector α, we use α J and α J c to denote its subvectors formed by indices in J(α 0 ) and {1, ..., 2p} \ J(α 0 ), respectively.
Assumption 3 (Compatibility Condition). (i) When δ 0 = 0, there is a neighborhood T 0 ⊂ T of τ 0 , and a constant φ > 0 such that for all τ ∈ T 0 and all α ∈ R 2p sat- Assumption 3 requires that the compatibility condition hold uniformly in τ over a neighbourhood of τ 0 when δ 0 = 0 and over the entire parameter space T when δ 0 = 0. Note that this assumption is imposed on the population covariance matrix E[X(τ )X(τ ) T ]; thus, a simple sufficient condition of Assumption 3 is that the smallest eigenvalue of E[X(τ )X(τ ) T ] is bounded away from zero uniformly in τ . Even if p > n, the population covariance can still be strictly positive definite while the sample covariance is not.

B.2 Restricted Nonlinearity Conditions
In this subsection, we make an assumption called a restricted nonlinear condition to deal with the quantile loss function. We extend condition D.4 in Belloni and Chernozhukov (2011) to accommodate the possible existence of the unknown threshold in our model (specifically, a uniform-in-τ version of the restricted nonlinear condition as in the compatibility condition).
(i) When δ 0 = 0, there exists a constant r * QR > 0 such that and that Remark B.1. As pointed out by Belloni and Chernozhukov (2011), if X T c follows a logconcave distribution conditional on Q for any nonzero c (e.g. if the distribution of X is multivariate normal), then Theorem 5.22 of Lovász and Vempala (2007) and the Hölder inequality imply that for all α ∈ A, We first describe the additional conditions on the distribution of (X, Q).
Assumption 5 (Additional Conditions on the Distribution of (X, Q)). Assume δ 0 = 0. In addition, there exists a neighborhood T 0 ⊂ T of τ 0 that satisfies the following.
(i) Q has a density function f Q (·) that is continuous and bounded away from zero on T 0 .
(ii) LetX denote all the components of X excluding Q in case that Q is an element of X. The conditional distribution of Q givenX has a density function f Q|X (q|x) that is bounded uniformly in both q ∈ T 0 andx.
Condition (i) implies that P {|Q − τ 0 | < ε} > 0 for any ε > 0, and condition (ii) requires that the conditional density of Q givenX be uniformly bounded. When τ 0 is identified, we require δ 0 to be considerably different from zero. This requirement is given in condition (iii). Note that this condition is concerned with E[ X T δ 0 2 |Q = τ ], which is an important quantity to develop asymptotic results when δ 0 = 0. Note that condition (iii) is a local condition with respect to τ in the sense that it has to hold only locally in a neighborhood of τ 0 .
The following additional moment conditions are useful to derive our theoretical results.
Remark B.2. Condition (i) requires that Q have non-negligible support on both sides of τ 0 . This condition can be viewed as a rank condition for identification of α 0 . In the standard threshold model with a fixed dimension, our condition is trivially satisfied by the rank condition such that both E[XX T 1{Q ≤ τ 0 }] and E[XX T 1{Q > τ 0 }] are positive definite (see e.g. Chan (1993) or Hansen (2000)). If the rank condition fails, the regression coefficient may not be identified and thus affecting the identification of the change point. In the high-dimensional setup, it is undesirable to impose the same rank condition due to the high-dimensionality. Instead, we replace it with condition (i). Condition (ii) requires the boundedness and certain smoothness of the conditional expectation functions , and prohibits degeneracy in one regime. The last two inequalities in condition (ii) are satisfied if for all τ ∈ T 0 and for all β satisfying 0 < E X T β ≤ c for some small c > 0.

C Additional Theoretical Results
In this part of the appendix, we consider the identification of (α 0 , τ 0 ) in (2.1) and show that an improved rate of convergence is possible for the excess risk by taking the second and third steps of estimation.
Theorem C.1 (Identification). (i) Assume that δ 0 = 0 and that the γ-th conditional quantile of Y given X and Q is uniquely given as The distribution of Q is absolutely continuous with respect to Lebesgue measure.
(iii) τ 0 ∈ T ≡ [τ , τ ], which is contained in a strict interior of the support of Q.
Theorem C.1 establishes sufficient conditions under which α 0 and τ 0 are identified. Conditions (i)-(v) in Theorem C.1 are standard. The non-singularity condition (v) is uniform in τ ∈ T and can be viewed as a natural extension of the usual rank condition in the linear model. Condition (vi) is a condition that imposes that the model is well separated from the case that there is no change point in the model.
Proof of Theorem C.1. Since the conditional quantile function is uniquely given as (C.1), it suffices to show that To begin with, write, assuming τ ≤ τ 0 , Now suppose D (α, τ ) in (C.2) is zero a.s. Then, it is also zero on the following event E: Thus, we have that This is equivalent to However, we have that This result combined with (C.2) implies that This also implies that Now consider the other case, that is τ < τ 0 . In this case, we have that Hence, in this case, modifying the definition of E in (C.3) to be and proceeding the arguments identical to those above gives the desired result.

C.2 Improved Risk Consistency
The following theorem shows that an improved rate of convergence is possible for the excess risk by taking the second and third steps of estimation. Recall that ω n ∝ log(p ∨ n) n .
Theorem C.2 (Improved Risk Consistency). Let Assumption 1 hold. In addition, assume The proof of this theorem is given in Appendix E.3. For the sake of not introducing additional assumptions in this section, we have assumed in Theorem C.2 that | τ − τ 0 | = O P (n −1 ) when τ 0 is identifiable. Its formal statement is given by Theorem 4.3 in Section 4.
Remark C.1. As in Theorem 3.1, the risk consistency part of Theorem C.2 holds whether or not δ 0 = 0. We obtain the improved rate of convergence in probability for the excess risk by combining the fact that our objective function is convex with respect to α given each τ with the second-step estimation results that (i) if δ = 0, then τ is within a shrinking local neighborhood of τ 0 , and (ii) when δ 0 = 0, τ does not affect the excess risk in the sense that R (α 0 , τ ) = 0 for all τ ∈ T .

D Regularity conditions on the general loss function
Let Y be a scalar variable of outcome and X be a vector of p-dimensional observed characteristics. Suppose there is an observable scalar variable Q such that the conditional distribution of Y or some feature of that (given X) depends on: where δ 0 = θ 0 − β 0 . Let ρ : R × R → R + be a loss function under consideration, whose analytical form is clear in specific models. Suppose the true parameters are defined as the minimizer of the expected loss: where A and T denote the parameter spaces for (β 0 , δ 0 ) and τ 0 . Here β represents the components of "baseline parameters", while δ represents the structural changes; τ is the change point value where the structural changes occur, if any. By construction, τ 0 is not unique when δ 0 = 0. For each (β, δ) ∈ A and τ ∈ T , define 2p × 1 vectors: Then X T β + X T δ1{Q > τ } = X(τ ) T α, and by letting α 0 ≡ (β T 0 , δ T 0 ) T , we can write (D.1) more compactly as: In quantile regression models, for a given quantile γ ∈ (0, 1), recall that D.1 When δ 0 = 0 and τ 0 is identified For a constant η > 0, define and where B(β 0 , r) and G(θ 0 , r) are defined in (B.3). Note that r 1 (η) and r 2 (η) are the maximal radii over which the excess risk can be bounded below by the quadratic loss on {Q ≤ τ 0 } and {Q > τ 0 }, respectively.
(iv) There is a constant c 0 > 0 such that for all τ ∈ T 0 , We focus on a convex Lipchitz loss function, which is assumed in condition (i). It might be possible to weaken the convexity to a "restricted strong convexity condition" as in Loh and Wainwright (2013). For simplicity, we focus on the case of a convex loss, which is satisfied for quantile regression. However, unlike the framework of M-estimation in Negahban et al.
(2012) and Loh and Wainwright (2013), we do allow ρ(t 1 , t 2 ) to be non-differentiable, which admits the quantile regression model as a special case.

Condition (iii) requires that the excess risk can be bounded below by a quadratic function
locally when τ is fixed at τ 0 , while condition (iv) is an analogous condition when α is fixed at α 0 . conditions (iii) and (iv), combined with the convexity of ρ(Y, ·), helps us derive the rates of convergence (in the 1 norm) of the Lasso estimators of (α 0 , τ 0 ). Furthermore, these two conditions separate the conditions for α and τ , making them easier to interpret and verify.
Remark D.1. Condition (iii) of Assumption 7 is similar to the restricted nonlinear impact (RNI) condition of Belloni and Chernozhukov (2011). One may consider an alternative formulation as in van de Geer (2008) and Bühlmann and van de Geer (2011) (Chapter 6), which is known as the margin condition. But the margin condition needs to be adjusted to account for structural changes as in condition (iv). It would be an interesting future research topic to develop a general theory of high-dimensional M-estimation with an unknown sparsity-structural-change with general margin conditions.
Remark D.2. Assumptions 7 (iv) and 5 (iii) together imply that for all τ ∈ T 0 , there exists a constant c 0 > 0 such that
Assumption 8. E[ρ(Y, X(τ ) T α)] is three times continuously differentiable with respect to α, and there are constants c 1 , c 2 , L > 0 and a neighborhood T 0 of τ 0 such that the following conditions hold: for all large n and all τ ∈ T 0 , (i) there is M n > 0, which may depend on the sample size n, such that (ii) there is r > 0 such that for all β ∈ B(β 0 , r), θ ∈ G(θ 0 , r), α = (β T , θ T − β T ) T satisfies: (iii) α 0 is in the interior of the parameter space A, and The score-condition in the population level is expressed by m(τ 0 , α 0 ) = 0 since α 0 is in the interior of A by condition (iii). Conditions (i) and ( Verification of Assumption 7 (i). It is straightforward to show that the loss function for quantile regression is convex and satisfies the Liptschitz condition.

0}). By (B.3) of Belloni and Chernozhukov
where the last inequality follows immediately from the fact that F Y |X,Q (·|X, Q) is the CDF.

Verification of Assumption 7 (iii). Following the arguments analogous those used in (B.4)
of Belloni and Chernozhukov (2011), the mean value expansion implies: for some intermediate value t between 0 and z. By condition (ii) of Assumption 2, Hence, taking the expectation on where the last inequality follows from To see why (D.5) holds, note that by (B.5), for any nonzero β ∈ B(β 0 , r * QR ), which proves (D.5) immediately. Thus, we have shown that Assumption 7 (iii) holds for r 1 (η) with η * = C 2 /4 and r * = r * QR defined in (B.5) in Assumption 2. The case for r 2 (η) is similar and hence is omitted.
Verification of Assumption 7 (iv). We again start from (D.4) but with different choices of Then arguments similar to those used in verifying Assumptions 7 (ii)-(iii) yield that for τ < τ 0 , where t is an intermediate value t between 0 and z. Thus, we have that The case that τ > τ 0 is similar.
Verification of Assumption 8. Note that Hence, m j (τ 0 , α 0 ) = 0, for all j ≤ 2p. For condition (i) of Assumption 8, for all j ≤ 2p, for some constant C, where the last inequality follows from conditions (ii), (iii) and (v) of Assumption 1. Therefore, we have verified condition (i) of Assumption 8 with M n = CM 2 K 2 |δ 0 | 2 .
We now verify condition (ii) of Assumption 8. For all j and τ in a neighborhood of τ 0 , which implies the result immediately in view of Assumption 1. Finally, it is straightforward to verify condition (iii) using Assumption 2 (iii).

D.2 When δ 0 = 0
We now consider the case when δ 0 = 0. In this case, τ 0 is not identifiable, and there is actually no structural change in the sparsity. If α 0 is in the interior of A, then m(τ, α 0 ) = 0 for all τ ∈ T .
(v) α 0 is in the interior of the parameter space A, and there are constants c 1 an c 2 > 0 such that As in Lemma D.1, we now establish that Assumption 9 is satisfied for quantile regression models when δ 0 = 0.
Lemma D.2. Suppose that Assumptions 1 and 2 hold. Then Assumption 9 is satisfied.

D.2.1 Proof of Lemma D.2
Verification of Assumption 9 (i). This is the same as the verification of Assumption 7 (i).
Verification of Assumption 9 (ii). This can be verified exactly as in verification of Assumption 7 (ii) with α 0 = β 0 now.
Verification of Assumption 9 (iii). By the arguments identical to those used to verify As- where the last inequality follows from (B.7). This proves the case forr 1 (η). The case for r 2 (η) is similar and hence is omitted.
Verification of Assumptions 9 (iv) and (v). They can be verified similarly as in verification of Assumption 8 in the proof of Lemma Lemma D.1. For all j and τ ∈ T , which implies condition 9 (iv) in view of Assumption 1. It is also straightforward to verify condition 9 (v) using Assumption 2 (iii).

E Proofs of Theorems
Throughout the proofs, we define Without loss of generality let ν n (α J , τ ) = n −1 n i=1 ρ Y i , X iJ (τ ) T α J − Eρ Y, X J (τ ) T α J . In this section, we suppose that Assumptions 7 and 8 hold when δ 0 = 0 and that Assumption 9 holds when δ 0 = 0, respectively.

E.1 Useful Lemmas
For the positive constant K 1 in Assumption 1 (i), define Let x denote the smallest integer greater than or equal to a real number x. The following lemma bounds ν n (α, τ ).
By the symmetrization theorem (see, for example, Theorem 14.3 of Bühlmann and van de Geer (2011)) and then by the contraction theorem (see, for example, Theorem 14.4 of Bühlmann and van de Geer (2011)), where inequality (1) follows from Jensen's inequality, inequality (2) comes from the fact that X ij (τ ) is a step function with jump points on T ∩ {Q 1 , . . . , Q n }, and inequality (3)

is by
Bernstein's inequality for the exponential moment of an average (see, for example, Lemma (2011)), combined with the simple inequalities that exp(|x|) ≤ exp(x) + exp(−x) and that exp(max 1≤j≤J x j ) ≤ J j=1 exp(x j ). Then it follows that
Proof of (E.2): Recall that 1 , ..., n is a Rademacher sequence, independent of {Y i , X i , Q i } i≤n .
Note that where inequality (1) is by the symmetrization theorem, inequality (2) holds for some k ≡ log 2 (m 2n /m 1n ) , and inequality (3) follows from the contraction theorem.
Next, the identical arguments showing (E.4) yield uniformly in τ ∈ T . Then, as in the proof of (E.1), Bernstein's and Markov's inequalities imply that where the last equality follows by setting L 2 = 8L.
Proof of (E.3): As above, by the symmetrization and contraction theorems, we have that for some constant C 1 < ∞, where the last inequality is due to Theorem 2.14.1 of van der Vaart and Wellner (1996) with M 2 in Assumption 1 (v) and K 2 in Assumption 1 (ii). Specifically, we apply the second inequality of this theorem to the class Note that F is a Vapnik-Cervonenkis class, which has a uniformly bounded entropy integral and thus J(1, F) in their theorem is bounded, and that the L 2 norm of the envelope | i X T i δ 0 |1{|Q i − τ 0 | < η} is proportional to the square root of the length of T η : This implies the last inequality with C 1 being √ 2 times the entropy integral of the class F.

E.3 Proof of Theorem C.2
Define where T n ⊂ T will be specified below. For each τ , define It follows from the definition of α(τ ) in (E.9) that And also note that For each τ , (E.10) and the convexity of the following map which in turn yields the following inequality Furthermore, by the triangle inequality, (E.12) can be written as uniformly in τ ∈ T n .
We can repeat the same arguments for α(τ ) instead ofᾱ(τ ) due to (E.11) and (E.14), to (E.15) uniformly in τ ∈ T n . It remains to show that there exists a set T n such that τ ∈ T n w.p.a.1 and the corresponding M * = O(s). We split the remaining part of the proof into two cases: δ 0 = 0 and δ 0 = 0.
Therefore, the desired result follows immediately.

E.4 Proof of Theorem 4.1
Remark E.1. We first briefly provide the logic behind the proof of Theorem 4.1 here.
This follows from the risk consistency of R(α,τ ). Then, the identification conditions for α 0 and τ 0 (Assumptions 7 (ii)-(iv)), along with Assumption 6 (i), are useful to show that the risk consistency implies the consistency ofτ .
Step 1: All the three terms on the right hand side (RHS) of (E.19) are nonnegative. As a consequence, all the three terms on the RHS of (E.19) are bounded by R(α, τ ).
Proof of Step 1.
Step 1 is implied by the condition that To see this, the first two terms are nonnegative by simply multiplying show that the third term is nonnegative for all β ∈ R p and τ > τ 0 , set α = (β/2, β/2) in the which yields the nonnegativeness of the third term.
Step 2: Let a ∨ b = max(a, b) and a ∧ b = min(a, b). Prove:
Suppose that R(α, τ ) < ε. Applying the triangle inequality, for all β and τ > τ 0 , First, note that the first term on the RHS of (E.20) is the third term on the RHS of (E.19), hence is bounded by R(α, τ ) < ε.

E.5 Proof of Theorem 4.2
The proof consists of multiple steps. First, we obtain an intermediate convergence rate forτ based on the consistency of the risk and that ofτ . Second, we use the compatibility condition to obtain a tighter bound.
Step 2: Then, Proof of Step 2. Recall the following basic inequality in (E.7): Now applying Lemma E.1 to [ν n (α 0 ,τ ) − ν n (α,τ )] with a n and b n replaced by a n /2 and b n /2, we can rewrite the basic inequality in (E.26) by on both sides of the inequality above and using the fact that Therefore, we have proved Step 2.
We prove the remaining part of the steps by considering two cases: We first consider Case (ii).

Proof of
Step 3. By κ n D (α − α 0 ) J 1 > c α and the basic inequality (E.25) in Step 2, which enables us to apply the compatibility condition in Assumption 3.
Step 4: There is a constant C 0 > 0 such that

Proof
Step 4. Note that We bound the three terms on the right hand side of (E.31). When τ > τ 0 , there is a constant where the last inequality is by Assumptions 1, 5 (ii), 5 (iii), and 6 (ii).
Hence, the first and third terms of the right hand side of of (E.31) are bounded by 6C 1c0 (δ 0 )(τ − τ 0 ).

Proof of
Step 5. Recall that X ij is the jth element of X i , where i ≤ n, j ≤ p. By Assumption 1 and Step 1, By the mean value theorem, Here, recall that τ is the right-end point of T and |J(δ 0 )| 0 is the dimension of nonzero elements of δ 0 .
Furthermore, it is worth noting that the same theorem also implies that if for some sequence r n satisfying r 2 n φ n (r −1 n ) ≤ √ n, then
Then, note that we can set r −2 n = a n s n log(np) with s n = 1 and a n = κ n s log n due to Lemma E.2 and the rate of convergenceα − α 0 = O P (κ n s) given by Theorem 4.2. Next, we will apply a chaining argument to obtain the convergence rate of τ by repeatedly verifying the condition R * n ( τ ) ≤ R * n (τ 0 ) + O P (r −2 n ), with an iteratively improved rate r n . Applying Theorem 3.4.1 of van der Vaart and Wellner (1996) with r n = (a n log(np)) −1/2 , we have τ − τ 0 = O P (a n log(np)) = O P (κ n s log n log(np)) .
Next, we reset s n = κ n s (log n) 2 log(np) and a n = κ n s log n to apply Lemma E.2 again and then Theorem 3.4.1 of van der Vaart and Wellner (1996) with r n = (s n a n log(np)) −1/2 . It follows that In the next step, we set r n = √ n since it should satisfy the constraint that r 2 n φ n (r −1 n ) ≤ √ n as well. Then, we conclude that τ = τ 0 + O P (n −1 ). Furthermore, in view of Lemma E.2, τ = τ 0 + O P (n −1 ) implies that the asymptotic distribution of n ( τ − τ 0 ) is identical to n ( τ − τ 0 ) since each of them is characterized by the minimizer of the weak limit of n (R n (α, τ 0 + tn −1 ) − R n (α, τ 0 )) with α =α and α = α 0 , respectively. That is, the weak limits of the processes are identical due to Lemma E.2. Therefore, we have proved the first conclusion of the theorem. Lemma E.3 establishes the second conclusion.
Lemma E.2. Suppose that α ∈ A n ≡ {α = β T , δ T T : |α − α 0 | 1 ≤ Ka n } and τ ∈ T n ≡ {|τ − τ 0 | ≤ Ks n } for some K < ∞ and for some sequences a n and s n as n → ∞. Then, Proof of Lemma E.2. Noting that However, the Lipschitz property of ρ yields that where log(np) term comes from the Bernstein inequality and the s n term follows from the fact that E 1 n n i=1 1 {τ 0 < Q i ≤ τ } = E1 {τ 0 < Q i ≤ τ } ≤ C · Ks n due to the boundedness of the density of Q i around τ 0 . The other term D n2 (α, τ ) can be bounded similarly. The case of τ < τ 0 can be treated analogously and hence details are omitted. Lemma E.3. We have that n ( τ − τ 0 ) converges in distribution to the smallest minimizer of a compound Poisson process defined in Theorem 4.3.
Proof of Lemma E.3. The convergence rate of τ is standard as commented in the beginning of the proof of Theorem 4.3 and thus details are omitted here. We present the characterization of the asymptotic distribution for the given convergence rate n.
Thus, the asymptotic distribution of n ( τ − τ 0 ) is characterized by the smallest minimizer of the weak limit of The weak limit of the empirical process M n (·) is well developed in the literature, (see e.g. Pons (2003); Kosorok and Song (2007); Lee and Seo (2008)) and the argmax continuous mapping theorem by Seijo and Sen (2011b) yields the asymptotic distribution, namely the smallest minimizer of a compound Poisson process, which is defined in Theorem 4.3.

E.7 Proof of Theorem 4.4
Let D ≡ D ( τ ). It follows from the definition of α in (2.5) that From this, we obtain the following inequality Now applying Lemma E.1 to [ν n (α 0 , τ ) − ν n ( α, τ )], we rewrite the basic inequality in (E.36) by As before, adding ω n D ( α − α 0 ) 1 on both sides of the inequality above and using the fact As in the proof of Theorem 4.2, we consider two cases: (i) ω n D ( α − α 0 ) J 1 ≤ |R(α 0 , τ )|; (ii) ω n D ( α − α 0 ) J 1 > |R(α 0 , τ )|. We first consider case (ii). Recall that Z 2 = (EZ 2 ) 1/2 for a random variable Z. It follows from the compatibility condition (Assumption 3) and the same arguments as in (E.28) that Step 5 of the proof of Theorem 4.2, there is a constant C 0 > 0 such that w.p.a.1. Combining (E.37)-(E.39) with a sufficiently smallc yields for some finite constant C > 0. Since | τ − τ 0 | = O P (n −1 ) by Theorem 4.3, the desired results follow (E.40) immediately. Now we consider case (i). In this case, As shown in the proof of Theorem C.2, we have that Therefore, we obtain the desired results in case (i) as well by combining (E.42) with (E.41).
We first prove the following two lemmas. Definē and τ ∈ T n if δ 0 = 0; suppose that s 4 log s = o(n) and τ is any value in T if δ 0 = 0. Then Proof of Lemma E.4. Let k n = s log s n . We first prove that for any > 0, there is C > 0, with probability at least 1 − , Once this is proved, then by the continuity ofQ n , there is a local minimizer ofQ n (α J ) inside Due to the convexity ofQ n , such a local minimizer is also global. We now prove (E.45). Write Then for all |α J − α 0J | 2 = C k n , .
where the last inequality follows from M n n −1 log n < 1/3c 1 C k n and c 3 s 3/2 C 2 k 2 n < 1/3c 1 C k n .
These follow from the conditions M 2 n (log n) 2 /(s log s) = o(n) and s 4 log s = o(n).
To analyze (2), by the symmetrization theorem and the contraction theorem (see, for example, Theorems 14.3 and 14.4 of Bühlmann and van de Geer (2011)), there is a Rademacher sequence 1 , ..., n independent of {Y i , X i , Q i } i≤n such that (note that when δ 0 = 0, α J = β J , which is free of τ ) which is bounded by the sum of the following two terms, V 1n + V 2n , due to the triangle inequality and the fact that |α J − α 0J | 1 ≤ |α J − α 0J | 2 √ s: first, when δ 0 = 0, V 1n ≡ 0; second, when δ 0 = 0 and τ 0 is identifiable, we have that by bounding the maximum over j with summation and using the maximal inequality in Theorem 2.14.1 in van der Vaart and Wellner (1996) since the class of transformations Here the bound is uniform and determined by the L 2 -norm of the envelope, which is proportional to Note that due to the Bernstein's moment inequality (Lemma 14.12 of Bühlmann and van de Geer (2011) for some C 2 > 0. Therefore, where the last inequality is due to s 2 log n/ log s = o(n). Therefore, conditioning on the event τ ∈ T n when δ 0 = 0, or for τ ∈ T when δ 0 = 0, with probability at least 1 − , In addition, note that P (max j∈J(α 0 ) |w j | = 0) = 1, so (3) = 0 with probability approaching one. Hence The last inequality holds for C > 15LC 2 c 1 . By the continuity ofQ n , there is a local minimizer ofQ n (α J ) inside {α J ∈ R s : |α 0J − α J | 2 ≤ C k n }, which is also a global minimizer due to the convexity.
On R 2p , recall that R n (τ, α) = 1 n Without introducing confusions, we also writeᾱ = (ᾱ J , 0) for notational simplicity. This notation indicates thatᾱ has zero entries on the indices outside the oracle index set J(α 0 ).
We prove the following lemma.
By the mean value theorem, there is h in the segment between α and (α J , 0), Because h is on the segment between α and (α J , 0), so h ∈ H. So for all j / ∈ J(α 0 ), We now argue that we can apply Assumption 8 (ii). Let c n ≡ s (log s) /n + r n .
Thus uniformly over α ∈ H, R n (τ, Also, w.p.a.1, w j = 1 and D j ≥ D for all j / ∈ J(α 0 ). Hence with probability approaching one, Q n (α J , 0) − Q n (α) equals Proof of Theorem 4.5. Conditions in Lemmas E.4 and E.5 are expressed in terms of M n .
By Lemma D.1, we verify that in quantile regression models, M n = Cs 1/2 for some C > 0.
Then all the required conditions in Lemmas E.4 and E.5 are satisfied by the conditions imposed in Theorem 4.5.
By Lemmas E.4 and E.5, w.p.a.1, for any α = (α J , α J c ) ∈ H, Hence (ᾱ J , 0) is a local minimizer of S n , which is also a global minimizer due to the convexity.
This implies that w.p.a.1, α = ( α J , α J c ) satisfies: α J c = 0, and α J =ᾱ J , so Finally, by (E.48), and that R(α 0 , τ ) ≤ Cs| τ − τ 0 | = O P (sn −1 ), E.9 Proof of Theorem 4.6 Recall that by Theorems 4.3 and 4.5, we have (E.46) and the set of regressors with nonzero coefficients is recovered w.p.a.1. Hence we can restrict ourselves on the oracle space J(α 0 ). In view of (E.46), define r n ≡ n −1 s log s and s n . Let The following lemma is useful to establish that α 0 can be estimated as if τ 0 were known.
Lemma E.6 (Asymptotic Equivalence). Assume that ∂ ∂α E ρ Y, X T α |Q = t exists for all t in a neighborhood of τ 0 and all its elements are continuous and bounded. Suppose that s 3 (log s)(log n) = o (n). Then sup α J ∈An,τ ∈Tn This lemma implies that the asymptotic distribution of argmin α J R * n (α J , τ ) can be characterized by α * J ≡ argmin α J R * n (α J , τ 0 ). It then follows immediately from the variable selection consistency that the asymptotic distribution of α J is equivalent to that of α * J . Therefore, we have proved the theorem.
Proof of Lemma E.6. Noting that To prove this lemma, we consider empirical processes and apply the maximal inequality in Theorem 2.14.2 of van der Vaart and Wellner (1996).
First, for G n1 (α J , τ ), we consider the following class of functions indexed by (β J , τ ): Note that the Lipschitz property of ρ yields that Thus, we let the envelope function be F n (X iJ , Q i ) ≡ |X iJ | 2 Kr n 1 {|Q i − τ 0 | ≤ Ks n } and note that its L 2 norm is O √ sr n √ s n .
To compute the bracketing integral note that its 2ε bracketing number is bounded by the product of the ε bracketing num- Kosorok (2008) since both classes are bounded w.p.a.1 (note that w.p.a.1, |X iJ | 2 Kr n < C < ∞ for some constant C). That is, Let F n1 (X iJ ) ≡ |X iJ | 2 Kr n and l n (X iJ ) ≡ |X iJ | 2 . Note that by Theorem 2.7.11 of van der Vaart and Wellner (1996), the Lipschitz property of ρ implies that which in turn implies that, for some constant C, where the last inequality holds since a ε-ball contains a hypercube with side length ε/ √ s in the s-dimensional Euclidean space. On the other hand, for the second class of functions F n2 with the envelope function F n2 (Q i ) ≡ 1 {|Q i − τ 0 | ≤ Ks n }, we have that for some constant C. Combining these results together yields that for all sufficiently large n. Then we have that for all sufficiently large n. Thus, by the maximal inequality in Theorem 2.14.2 of van der Vaart and Wellner (1996), where the last equality follows from the restriction that s 3 (log s)(log n) = o (n). Identical arguments also apply to G n2 (α J , τ ).
Turning to ED n (α, τ ) , note that by the condition that ∂ ∂α E ρ Y, X T α |Q = t exists for all t in a neighborhood of τ 0 and all its elements are continuous and bounded, we have that for some mean valueβ J between β J and β 0J , where the last equality follows from the restriction that s 3 (log s) = o (n). Since the same holds for the other term in ED n , sup |ED n (α, τ )| = o (n −1 ) as desired.

E.11 Proof of Theorem 5.1
If δ 0 = 0, τ 0 is non-identifiable. In this case, we decompose the excess risk in the following way: We split the proof into three steps.
Proof of Step 1. As in the proof of Step 1 in the proof of Theorem 4.2, Assumption 9 (iii) implies that For any r > 0, note that R(α,τ ) = o P (1) implies that the event R(α,τ ) < r 2 holds w.p.a.1.
Therefore, we have shown thatβ ∈B(β 0 , r,τ ). The other case can be proved similarly.

Proof. By
Step 2, which enables us to apply the compatibility condition in Assumption 3.
Note that where (1) is simply an identity, (2) from Assumption 9 (iii) , and (3) is due to (E.49). Hence, Hence (ᾱ J , 0) is a local minimizer of S n , which is also a global minimizer due to the convexity.