Functional Data Analysis with Rough Sample Paths?

Functional data are typically modeled as sample paths of smooth stochastic processes in order to mitigate the fact that they are often observed discretely and noisily, occasionally irregularly and sparsely. The smoothness assumption is imposed to allow for the use of smoothing techniques that annihilate the noise. At the same time, imposing the smoothness assumption excludes a considerable range of stochastic processes, most notably diffusion processes. Under perfect observation of the sample paths, such processes would not need to be excluded from the realm of functional data analysis. In this paper, we introduce a careful modification of existing methods, dubbed the"reflected triangle estimator", and show that this allows for the functional data analysis of processes with nowhere differentiable sample paths, even when these are discretely and noisily observed, including under irregular and sparse designs. Our estimator matches the established rates of convergence for processes with smooth paths, and furthermore attains the same optimal rates as one would get under perfect observation. Thus, with reflected triangle estimation, the scope of applicability of much of the methodology developed for discretely/irregularly/noisily/sparsely sampled functional data is considerably extended. By way of simulation it is shown that the advantages furnished are reflected in practice, hinting at potential closer links with the field of diffusion inference.


Introduction
Functional Data Analysis (FDA) treats the problem of drawing statistical inferences pertaining to the law of a random element X valued in an appropriate function space H, based on n independent (or weakly dependent) realisations X 1 , ..., X n distributed as X.The ambient function space is usually taken to be H = L 2 [0, 1].This allows for a very wide range of random functions X. Technical considerations require the minimal assumption that X be almost surely continuous, namely in order to be able to make point-wise sense of X as a stochastic process {X(t) : [0, 1] → R} (see Hsing and Eubank [9], p.175).But once this is done, the covariance operator of X is assured to be trace-class and one can carry out the programme set out in classical monographs such as Bosq [1]or Ferraty [5].
These monographs, as well a great number of theory/methods papers assume that one can perfectly observe the complete sample path ( see, e.g., Dauxois et al. [3], Hall and Hosseini-Nasab [7], Hörmann and Kokoszka [11] [10], Hall and Horowitz [6], Panaretos and Tavakoli [13,14], Delaigle and Hall [4]).This "complete observation" framework only requires continuity and encompasses processes that can be non-smooth, even nowhere differentiable, such as diffusion processes.When paths are only observed at finitely many locations, though, the path continuity assumption alone is not deemed enough, particularly when the sampled values are additionally corrupted by noise.It is supplemented by additional smoothness assumptions, which are then leveraged in order to non-parametrically estimate the mean and covariance function, as a first step towards further analysis.For instance, the approach popularised by Ramsay and Silverman [16] assumes that X possesses C 2 -paths, and advocates pre-smoothing each sampled path to construct fully observed proxies.Yao et al. [18], on the other hand assume a C 2 covariance which they smooth directly, an approach which adapts well to irregular and sparse sampling designs.Such assumptions enable the use of FDA techniques in the more realistic scenario of discrete/noisy observation, but limit its scope to processes with smooth paths.
Our purpose is to demonstrate that it is actually not necessary to assume twice (or even once) differentiability of the sample paths in order to carry out a functional data analysis with noisily sampled paths.In particular, we show that one can obtain the same uniform and almost sure rates of convergence for mean and covariance estimation even with nowhere differentiable paths, potentially irregularly/sparsely sampled and with measurement error contamination.Our approach is based on a conceptually simple modification of procedures based on smoothing the covariance, e.g.Yao et al. [18], Hall et al. [8] and Li and Hsing [12].The key observation is that rough but continuous sample paths may well possess a covariance function that is smooth except on the diagonal.Prominent such examples include diffusion processes like Brownian motion, the Brownian bridge, or the Ornstein-Uhlenbeck process, to name but a few.So while the path-smoothing approach will fail, the covariance smoothing approach can still succeed if suitably modified to avoid smoothing over (and thus smearing) the singularity along diagonal.
While the modification is conceptually simple, its ramifications go beyond a weakening of assumptions.From the point of view of theory, they challenge the tenet that Functional Data Analysis relies on smoothness of the functional data.And, in doing so they hint at potential closer links with the area of inference for diffusions and related processes, which does not so far share considerable overlap with functional data analysis.From the practical point of view, the modification is easy to implement and enjoys the same theoretical guarantees as the well-established covariance smoothing approach.Thus, it can essentially "automatically" extend much of the FDA toolbox constructed with covariance smoothing at its foundation, to cover a considerably broader range of processes.
To our best knowledge this is the first study to estimate the covariance structure of processes with rough sample paths in the FDA setting of irregular (perhaps sparse) and noise corrupted measurements of multiple trajectories.In a distinct context, Poß et al. [15] aimed to detect covariance singularities, in the context of functional regression with "points of impact".However, as they pointed out, their methodology does not bear on the problem of actually estimating the covariance.

Problem and Background
Let H = L 2 [0, 1] be the Hilbert space of squared integrable real-valued functions on the unit interval [0, 1], equiped with the following inner product and the induced norm • H . Let X be a random element of H (formally a measurable map between some probability space (Ω, A, P) and H) with almost surely continuous sample paths.Let µ(t) = E{X(t)} be its mean and C(•, •) = Cov {X(•), X(•)} be its covariance function.The estimation of µ and C is a basic first step involved in most functional data analyses, including regression, prediction, testing, and classification.This is to be done based on n i.i.d curves which are observed only at some (random) nodes, subject to measurement error contamination.Concretely, we observe where {U ij } forms an i.i.d.array of measurement errors with finite variance σ 2 , and {T ij } is an array of random design points, assumed without loss of generality to satisfy T ik < T ij , for k < j.Finally, {r n } is the 2 number of measurements per sample path, allowed to take different values as n grows, accordingly yielding dense (r n → ∞ as n → ∞) or sparse (r n < R < ∞) settings.The elements of the arrays {X i } , {T ij } and {U ij } are totally independent across all indices i and j.
Perhaps the first general approach focusses on the dense regime (r n → ∞) and was popularised by Ramsay and Silverman [16].It consists in smoothing each sample path individually, i.e. applying a smoother to each of n scatterplots {(T ij ), Y ij } rn j=1 .Typically spline or local polynomial smoothing is applied, yielding n smoothed proxy curves X i in lieu of the true latent X i (t).The covariance of the proxy curves { X i } is then used as a smooth estimator of the true underlying covariance operator.Hall et al. [8] assert that this proxy covariance enjoys the same convergence rate as of the empirical estimator obtained by fully functional curves {X i } with no noise, under sufficiently dense observation.
Yao et al. [18] examine a similar problem, but focus on the case where r n is finite (the sparse case).In such a setting, one cannot smooth each individual sample path.Therefore, differently from the approach of Ramsay and Silverman [16], they opt to estimate the covariance directly by smoothing the "raw covariance" 2D scatterplot {((T ij , T ik ), Y (T ij )Y (T ik ))} 1≤k,j≤rn,i≤n .To circumvent identifiability issues near the diagonal (due to the noise contamination), they only retain scatterplot points with non-coincident time stamps, {(T ij , T ik ) | i = 1, 2, . . ., n, k = j}, and apply a local polynomial smoother on [0, 1] 2 to obtain an estimator.Though the procedure was motivated by the sparse case (r n < R < ∞) it can just as well be applied to the dense case (r n → ∞), and Li and Hsing [12] later obtained a complete description of the convergence rates of such a procedure regardless of the sampling regime, making the dependence on r n explicit.
Neither the path-smoothing approach (Ramsay and Silverman [16]), nor the covariance-smoothing approach (Yao et al. [18]) will apply in cases where the sample paths of X are non-smooth, however: • An essential requirement for the procedure of Ramsay and Silverman [16] is that the sample paths be of class C 2 almost surely.Therefore, there is no hope to use this procedure with paths of low regularity, for example paths that are continuous but may be nowhere differentiable.• The procedure of Yao et al. [18], requires the covariance to be of class C 2 over [0, 1] 2 , and thus the procedure does not apply to continuous processes that do not have twice (mean-square) differentiable sample paths.This excludes diffusion processes, among others, whose covariance fails to be differentiable along the diagonal.
The damage incurred by "roughness" of the sample paths to the path-smoothing approach seems difficult to repair (and in any case this approach only covers the dense sampling regime).Nevertheless, we show in the next section that a careful modification of the covariance smoothing approach, yields methodology that: on one hand is much more broadly applicable, including to nowhere differentiable processes like diffusion processes; and, on the other hand, attains the same rates of convergence as established by Li and Hsing [12], with the same explicit dependence on r n (hence covering both sparse and dense regimes).

Motivation
The main observation behind our approach is that, while non-smooth processes may fail to have a covariance that is C 2 on [0, 1] 2 , they may very well still possess a covariance that is smooth when restricted on the triangle = (s, t) and, of course, this restriction determines the overall covariance completely.To see this, consider the case of standard Brownian motion on [0, 1], whose sample paths are nowhere differentiable almost surely (and not even differentiable in mean-square).The corresponding covariance is C(s, t) = min{s, t}, and is nondifferentiable on the diagonal s = t.When restricted on the covariance becomes C(s, t) = t and is infinitely differentiable (even real analytic).Similar observations can be made for the standard Brownian bridge (C(s, t) = t − st on ), Ornstein-Uhlenbeck process (C(s, t) = e −β(t+s) (e 2βt − 1)/(2β) on , where β > 0 is the drift), geometric Brownian motion (C(s, t) = exp{(t + s)/2}) exp{t} − 1) on ), and many more examples.This class of covariance functions also includes isotropic covariance functions, where C(s, t) is a function of |s − t|, that in general admit lower degree regularity on the diagonal segment, see panel (a) of Figure 1.These are all instances of a very general setting with a covariance function that is of the form for some smooth function g.When this g is at least of class C 2 , or equivalently when C| ∈ C 2 ( ) (see condition C( 2)), we can view C as a function with symmetric components on and [0, 1] 2 \ , and: 1. first apply a smoother only to indices from the subset {(T ij , T ik ) | i = 1, . . ., n, k < j} to estimate C restricted to .In other words, the restriction k < j smooth only the scatterplot defined over the "triangle".2. then reflect the estimate on with respect to the diagonal in order to estimate C over all of [0, 1] 2 .
This approach entirely avoids the irregularity along the diagonal segment.Steps 1 and 2 suggest the term "reflected triangle estimator" for our approach.By contrast, the traditional "smoothed square estimator" of Yao et al. [18] smooths the set {(T ij , T ik ) | i = 1, . . ., n, k = j}.In doing so it ignores the diagonal itself (which helps with denoising) but still "reaches over" it, thus smearing the diagonal itself and producing an estimate that is C 1 on the diagonal (and everywhere), even asymptotically.This bias will especially impact the induced eigenvalue estimates, as these are intrinsically linked to the area surrounding the diagonal.This effect becomes more transparent if one also concerns the fact that in practice we only have access to some discretized matrix from of the underlying covariance functions.
The examples in Figure 1 illustrate the situation in two cases.In particular case (b) shows that even if we have a flat (nowhere strictly convex) surface on this phenomenon is unavoidable.In particular Example 1 below shows that even for case (b) where we have a flat (nowhere strictly convex) surface on this phenomenon is unavoidable.
The precise definition of our estimator is provided in the next subsection.Subsection 3.3 shows that it inherits the same (optimal) rates as one would have in the smooth case.

The Reflected Triangle Estimator
Recall model (2.1) and write X ij for X i (T ij ).For the sake of simplicity, we drop the index of r n and use r instead, but it is understood that r can vary with n.Using the classical decomposition we define the estimator where µ(•) and its limiting behaviour can be obtained by employing well established results in literature (e.g. in simulation studies as well as Corollary 1 below, we choose to use the linear kernel smoothing method proposed in Li and Hsing [12]).
As for the second moment function G(•, •), the main object of interest, we propose where a 0 is a local quadratic smoother of the restriction of C on , defined via G •) a kernel with (positive semi-definite) bandwidth matrix H G dependent on n.We highlight that the summation ranges only over 1 ≤ k < j ≤ r, i.e.only cross product observations belonging to the interior of contribute to the estimate of G on .The estimate of G on [0, 1] 2 \ is defined by reflection as in Equation (3.1).Equivalently, equation (3.2) assigns weight zero to all cross products Y ij Y ik such that j ≥ k.Obviously in absence of measurement error we could also include observations on the diagonal and replace the summation set {1 ≤ k < j ≤ r} by {1 ≤ k ≤ j ≤ r}.See Figure 2.
Despite its unconventional definition, our estimator can still be expressed in closed form, and thus implemented with ease.Define where vech(M) is a vector obtained by column-wise vectorization of the matrix M follows by elimination of the diagonal elements as well as upper triangle elements of M .The vector J, whose dimension is clear from the context, is a vector with all elements equal to one.Finally, we define the vector e 1 , whose dimension is clear from context, to be the vector with leading element equals to one and zeroes elsewhere, and where K H G T (s,t) should be understood element-wise.
In this notation, the local least square problem (3.2) admits the following closed-form solution See Ruppert and Wand [17] for more details.In the remainder of the paper, we assume H G to be diagonal, The univariate kernel W (•) depends in general on n and can be defined in one of the forms {W n } or {W n } below: where {σ n } n is a sequences of positive numbers tending to zero sufficiently fast such that W n (1 + ) O h 4 G .Section 5 explains this particular choice of kernels and W (•), and discusses other choices.

Asymptotic Theory
We will now establish the almost sure uniform convergence rates of the estimators G(•, •) appearing in (3.3).Our conditions can be stated as follows: C(0) There exists some positive number M T for which we have 0 The second order partial derivatives of C on the (lower) triangle = (s, t) ∈ [0, 1] 2 |0 ≤ t ≤ s ≤ 1 exist and are bounded i.e.C| ∈ C 2 ( ); differentiability on boundary points is defined via one-sided limits.
Assumptions C(0) and C(1) are standard.Assumption C(2) relaxes the usual smoothness assumption on the covariance, to accommodate non-smooth paths.Our main results on the almost sure convergence rates of our estimator is now as follows: Theorem 1.Under Conditions C(0)-C(2), for α ∈ (4, ∞), it holds with probability 1 that The following corollary states sufficient conditions such that, under a dense sampling scheme, the right hand side of (3.6) will be dominated by (log n/n) 1/2 .This similarly happens for the discrepancy µ(•) and µ(•), where µ is obtained based on an application of locally weighted least square regression method with the same univariate kernel W (•) as above and bandwidth parameter h µ , see Theorem 3.1 of Li and Hsing [12].It thus follows that, under dense observation, the rate of our covariance estimator matches the optimal rate under complete observation (see Corollary 4.1 of Bosq [1]), bridging the gap with "fully observed theory" that does not assume smoothness (as described in the introduction).
In particular we have, with probability 1, that

Simulations
To illustrate the finite-sample performance of our method, we run a small simulation study on two examples, one involving a nowhere differentiable process, and one involving a C 2 process.In either case, we apply our "triangle" method as well as the classical "square" smoothing methods (Yao et al. [18] and Li and Hsing [12], for example), and compare the results.Though our method is applicable in both examples (since it rests on weaker assumptions), the classical covariance smoothing method only applies to the smooth example.Naturally, we expect our method to perform better in the rough example, and worse in the smooth example (since it does not take full advantage of the smoothness).The interesting point we observe, however, is that our method performs considerably better in the rough case, and comparably well in the C 2 case.See Figures 6 and 10 which depict the true underlying covariance functions as well as the resulting estimated functions based on one of the simulation runs applying each of the methods discussed above.
Example 1: First, we consider the example of a standard Brownian motion, with covariance of the form C(s, t) = min{s, t}, 0 ≤ s, t ≤ 1.This is of class C ∞ on , but fails to be C 1 on the unit square.We take n = 100, and r ∈ {5, 20, 50}, and run 100 replications in each case.The results are summarised in Figures 3a-5c; to allow for a compact presentation, each figure uses blue to demonstrate the target value; gray or black are used to denote the values corresponding to the triangular method; and red is assigned to the square method.The proposed method considerably outperforms the existing classical approach in estimating the covariance kernel, as well as its spectrum.Upon closer inspection we saw that the effect was particularly pronounced near the diagonal, as is reasonable to expect.This is reflected in the estimated eigenvalues: the triangular method performs better in estimating the leading eigenvalues, and substantially better for eigenvalues of larger order, see Figures 4a-4c.Figure 5 similarly shows that the triangular method performs well in estimating all the leading eigenfunctions, even for as few as r = 5 sampled points per curve.By contrast the square method generally encounters considerable difficulties with eigenfunctions other than the leading one, and require considerably higher values of r in order to capture higher order eigenfunctions.
Example 2: Next, we consider a process X(s) = µ(s)+Z 1 g 1 (s)+Z 2 g 2 (s)+Z 3 g 3 (s), where µ(s) = 5(s−0.6) 2 , Z j iid ∼ N (0, 0.2), g 1 (s) = 1 is a constant function, g 2 (s) = sin(2πs), and g 3 is chosen to be (a multiple of) four-fold convolution of the constant function 1 over 0, 1  4 i.e.The function g 3 (s) is constructed to have a shape similar to that of cos(2πs).Consequently, this simulation scenario yields a setting similar to that of Simulation 1 in Li and Hsing [12] with the difference that, clearly, the covariance kernel here is of class C 2 but not C 3 on (or the unit square).The point of this modification is to see the net effect of going from a C 0 process to a C 2 process (rather than from C 0 to C ∞ , if we had picked cos(2πs) instead of g 3 (s)).We take n = 200, r ∈ {5, 20, 50}, and run 100 replications in each case.The simulation results are summarised in Figures 7a-9c.We use the same colour coding as before: blue for the target, grey/black for the triangular method, and red for the square method.Interestingly, the triangular method performs slightly better than the square method when it comes to the covariance as a whole (Figure 7a).The two approaches perform very similarly in estimating the leading eigenfunctions (Figure 9).When it comes to the eigenvalues, we see an interesting pattern: both methods perform comparably in the the non zero eigenvalues (up to order 3).The square method does a clearly better job at capturing the fact that the remaining eigenvalues are zero (except when going to higher orders where both methods start having trouble).See Figures 8a-8c  Simple calculations shows that G(•, •) admits the following representation where which yields with M p , p = 1, 2, 3, corresponding to the elements of row matrix appearing in (5.1).Likewise, for µ(•) we have where and We will make use of these expressions in our proofs.
Proof of Theorem 1.We can drop the index n for sake of simplicity.According to (3.3) and observing that where S 0,0 (s, t) = S 0,0 (s, t) − G(s, t)A 0,0 − h G G (1,0) (s, t)A 1,0 − h G G (0,1) (s, t)A 0,1 =: S 0,0 (s, t) − Λ 0,0 (s, t), We now investigate the different terms appearing in (5.4) separately.First, consider S 0,0 and observe that The expressions A 1 − A 4 , representing the variance term, can be written in the general form =: Z1,1 (s, t) + Z1,0 (s, t) + Z0,1 (s, t) + Z0,0 (s, t), where each Z ijk has mean zero.The term A 5 corresponds to the bias term for which we use a Taylor series expansion to obtain the desired almost sure uniform bound.For (5.5), observe that (5.10)For (5.6) (similarly (5.7)) we have (5.12) For (5.8) we have Regarding E Z0,0 (s, t) = 0, for all 0 ≤ t ≤ s ≤ 1, we have (5.15) The remaining argument for (5.8) is similar to the proof of Lemma (8.2.5) of Hsing and Eubank [9].In more detail, for the summation term appearing in (5.15) we have According to condition C(1) and choosing Q n in such a way that logn , almost surely uniformly.In more detail For B 2 , first, define We then apply Bennett's concentration inequality (see [2]) to complete the proof of this part.We obtain a 13 uniform upper bound for Var Cov (Z ij1k1 (s, t), Z ij2k2 (s, t)) Choosing η large enough we conclude summability of (5.17).This result together combined with the Borel-Cantelli lemma completes the proof of this part.In other words we conclude there exists a subset Ω 0 ⊂ Ω of full probability measure such that for each ω ∈ Ω 0 there exists n 0 = n 0 (ω) with , n ≥ n 0 . (5.18) We now turn to the bias term A 5 and investigate its convergence.Observe that = O h 2 G , a.s uniformly on 0 ≤ t ≤ s ≤ 1. (5.19) The last expression is a consequence of the smoothness condition C(2).Relations (5.18) and (5.19) complete the proof for term S 0,0 appearing in decomposition (5.4).The proofs for the two other terms S 1,0 and S 0,1 are similar except for W † (•)W † (•) appearing in (5.9) and (5.11) for each of which we may use either W † (•)Ξ † p (•), p = 1, 2, 3, instead.One may also easily modify the arguments for (5.13) and (5.14).The result for the terms M 1 , M 2 , M 3 , and D can be obtain by a slight modification of the above argument assigning Z ijk = 1 in relation (5.8).

Choice of Kernel Function
In the formulation of our methods and asymptotic theory, we recommended the use of two specific kernel functions, whose form evolves with n, namely W n (u) (equation (3.4)) and W n (u) (equation (3.5)).These differ from a more standard choice of a smooth compactly supported kernel function whose form is fixed with respect to n.The reason for this is technical: with such a choice, there is a positive probability that the estimators µ and G may fail to exist on a set of positive Lebesgue measure, regardless of sample size.To see this, consider the mean estimator, and a standard design where the {T ij } are independent and uniform on [0, 1].In such a case, the support of the mean is sampled at the nr n independent and uniform locations, and writing T (1) = min i≤n,j≤rn T ij , we have P{T (1) > h} = (1 − h) nrn > 0, for all n ≥ 1.On this event, the estimator is undefined on an open set near the left boundary.This is not just a boundary issue: it can also occur in the "interior" part of domain of the mean function, i.e. it can be shown that P ∪ nrn j=2 {T (j) − T (j−1) > 2h} > 0, where {T (j) } 1≤j≤nrn is the sequence of ordered statistics of {T ij }.To circumvent this problem, we choose a sequence such that W n (u) = 0 for all n and u ∈ R satisfying:

Fig 2 :
Fig 2: Illustration on a single Brownian motion path: (a) The sample path in blue, five sampled values in black, and the corresponding noise-corrupted observations with red asterisks.(b) The cross products of all noisy observations.Only observations in blue are used for the definition of the triangular estimator.Observations in red are excluded to annhilate the measurement error.And observations in black are excluded from smoothing to avoid smearing the diagonal.

Fig 10 :
Fig 10: (a): Covariance function of Example 2. (b): C, based on one of the simulation runs, applying the proposed reflected triangle approach.(c): C, based on the same simulation run as part (b), applying the squared domain approach.

C( 3 ) 1 Ξ 2 G 2 G
The Fourier transforms of the kernel sequence W n are integrable with bounded integrals, i.e.W † n (u) du < M < ∞.C(4) Defining Ξ n,p (u) = W n (u)u p ,the Fourier transforms of Ξ n,p are integrable with bounded integrals, i.e.Ξ † n,p (u) du < M < ∞, for p = 1, 2, 3. C(5) The kernel functions W n (•) are symmetric around zero and monotone non-increasing on [1, +∞).C(6) The functions Ξ n,p (•) are non-increasing on [1, ∞) and the limiting objects Ξ n,p (1 + ) := lim u n,p (u) (= W n (1 + )) tend to zero as n tends to infinity, for p = 1, 2, 3.Where we are using g † (•) to indicate the Fourier transform g † (•) = 1 2π e −iu• g(u)du, for general function g(•), and i to denote the imaginary unit.Kernels satisfying (C(3))-(C(6)) will still yield Theorems 1 but with the presence of the additional terms O h −W n (1 + ) in the right hand side of (3.6).To remove this additional term, one needs to ensure that this term is dominated by the remaining terms, and this requires specifying the kernel more concretely.Our special choices of W n (u) (equation (3.4)) or W n (u) (equation (3.5)) do just that: these are kernels that satisfy C(3)-C(6) while also ensuring that the extra term O h −W n (1 + ) is dominated by the remaining terms.