Trends? Complicated answers to a simple question

ABSTRACT Trend significance of time series that are serially correlated is once more addressed. Most conventional techniques to “pre-whiten” the series prior to calculating trends rely on the assumption of autoregressive residual noise, AR(1). Monthly recordings of 40 water level stations in Germany are investigated, revealing strong memory up to lag 2. A new scheme (PW(p) ) is introduced that extends pre-whitening to AR(p) with p > 1. It performs well on surrogate series with prescribed trend and memory. For seven series the estimated trends are unrealistically off, raising doubts about the validity of the basic assumptions of short-memory noise. The series are characterized by a hockey stick pattern from which any pre-whitening produces trends that are all but plausible. The pattern also reveals that pre-whitening is not invariant under time reversal. Regardless of the validity of the noise model, these special cases serve as a warning for using pre-whitening in general.


Introduction
One tends to think that the estimation of trends from time series is, albeit crucial, a relatively simple statistical exercise. Very often it comes down to whether a found trend is random (i.e. zero) or not. Often, however, the series are too short to establish this reliably. This fact is aggravated by the presence of autocorrelation, which works to reduce the effective length of a series and thus increases the variance of the estimators and with it the rate of type I errors; the result is that more trends are falsely declared as nonzero (Lettenmaier 1976). From early on there was awareness around the problem of autocorrelation, and recipes to overcome it were soon developed, mostly in the field of econometrics (Cochrane and Orcutt 1949, Prais and Winsten 1954, Hildreth and Lu 1960. They were usually based on what was later called "pre-whitening," a process that can wipe residual noise memory (syn. autocorrelation) if it comes from a low-order autoregressive (AR) process. A somewhat simpler and more direct alternative to pre-whitening is variance correction, which works by directly adjusting the degrees of freedom of an autocorrelated series and the corresponding significance levels (Lettenmaier 1976, Hamed and Rao 1998, Yue and Wang 2004. Thorough reviews are presented by Khaliq et al. (2009) and, more recently, Serinaldi and Kilsby (2016).
To briefly lay out the overall problem, heuristic estimates from simple Monte Carlo experiments are displayed in Fig. 1, demonstrating the influence of a linear trend β on estimates of autocorrelation ρ and vice versa; details are given in Appendices A-C. For the former, the estimated ρ from the series without deterministic trend correctly converges with increasing series length to its true value of 0.7, whereas the estimates with a trend converge to ρ̂ = 1. Conversely, for samples of AR(1) series with growing ρ and no deterministic trend, critical (95%) β -levels are clearly increasing with ρ, resulting in an excess of corresponding type I errors over the nominal values of 5%.
Although the above econometric recipes were readily available by 1980 or earlier, they seemed to have escaped other research communities. For climate research, Kulkarni and von Storch (1995) (KS) had suggested a "cure" from autocorrelation by estimating the trend from the pre-whitened series. While the cure was able to reduce, as intended, type I errors back to the nominal significance level, the trend estimator itself indeed shrinks a real trend and inflates the type II error rate of corresponding tests Wang 2002). By applying the "cure" to both sides of the trend equation, as explained below, Wang and Swail (2001) avoid the shrinking of trends and obtain more realistic type II error rates.
An entirely different approach was taken by Yue et al. (2002): in responding to the KS issues they propose to maintain the originally estimated trend β overall, and instead let only the trend residuals undergo pre-whitening (after which they would be added back to the trend). This method, along with a scaled variant, has found the widest application by far, probably because an R package (zyp) exists that became very popular. 1 According to the package documentation, it handles the Yue et al. method together with the method of Wang and Swail (2001). Unfortunately, an undocumented variant of Y had long served as default (pers. communication A. Werner, co-author of zyp), but it was now updated to use the Wang and Swail method instead; more details are given in Appendix D. The method has itself been criticized repeatedly (Bayazit and Önöz 2004, Zhang and Zwiers 2004, Bürger 2017, 2022, Collaud Coen et al. 2020) because if only the residuals are shrunk, the signal-to-noise ratio is bound to be inflated and, again, so is the type I error rate. Using Monte Carlo experiments with stationary synthetic series Önöz 2007, Collaud Coen et al. 2020), the debate was somewhat settled in the sense that cases with limited memory, for which autoregressive processes of order 1 (AR(1)) are appropriate, could be handled well enough using the Wang and Swail (2001) method and some of its variants. Ironically, that method turns out to be just a special case of the Cochrane-Orcutt (CO) procedure that was established in econometrics some decades earlier, which apparently has never been reported although it follows easily from the basic equations. The CO approach is often referred to as feasible generalized least squares, since for β it employs a generalized least squares procedure with an unknown but independently estimated AR parameter, ρ̂ (see Section 2).
A different line of development was taken by merging the ρand β -regressors directly into one regression problem. For example, Zhang et al. (2000) regress a series directly against time and a time-lagged copy of itself; using the same approach, Hamed (2009) additionally consider AR estimation bias and fractional noise, and Klaus et al. (2015) use Auto-Regressive Integrated Moving-Average models for the noise. The approaches are applied in situations with ρ≪1, for which the approximate normality of the ordinary least squares (OLS) estimates for ρ may be acceptable, and collinearity in the regression between the ρ-and β -regressors (see Ariens et al. 2022) may likewise be negligible. One must be aware, however, that these approaches do not fit a linear trend line to a given series, as does the classical OLS estimate (see below).
This study introduces a pre-whitening scheme for the general (p ≥ 1) case of AR(p) noise that builds on the feasible generalized least squares approach. It is verified against a set of 40 water level stations from Northern Germany that are characterized by very strong autocorrelation, and compared to the classical scheme of order 1. Verification is done using the original as well as surrogate series with prescribed parameters, the latter being estimated ad hoc from the observations.
A group of particularly problematic station levels is identified that have the characteristic appearance of a hockey stick, 2 which raises the question of nonstationarity and long memory. And, finally, two issues related to that shape are identified that call into question the use of pre-whitening in general.

Methods and data
To formulate the problem precisely, the assumption is that the time series y = {y t } under investigation can be expressed as a deterministic linear trend plus an error term, where X = [1 x] with x = {x t } representing a time function (a counter, a date) 3 and e = {e t } some stationary noise process. The simplest solution is obtained if the error e is a white noise process. In that case, for any observed time series realization X, y, the OLS estimate with " + " denoting the pseudoinverse, is the best linear unbiased estimator and the maximum likelihood solution with a known distribution. A weaker assumption is that e is an autoregressive process of order p ≥ 1, AR(p), for which holds, with AR coefficients φ = (φ i ), i ≤ p, and a white noise process ε = {ε t }. At this point it is convenient to introduce the "backshift" operator, B, which shifts a time series v one time Not to be confused with the millennial hockey stick temperature pattern; see Mann et al. (1999). The term simply describes a specific shape and applies to deterministic and stochastic series alike. 3 Intercepts are included.
step backwards: Bv t = v t-1 . The equation can now be reformulated as The operator is, like B, a linear operator with which arithmetic operations (sum, product) can be performed as with any matrix. Accordingly, if the autoregressive error process e can be assumed to be of order p with known coefficients φ, one can apply the operator to Equation (1), and with Equation (4) one obtains: Equation (6) is called the pre-whitening equation (or model) of order p, denoted PW (p) . The resulting reduction of sample size by p can be circumvented for the case p = 1, in which case it is nothing other than the generalized least squares solution (Prais and Winsten 1954). Apart from that, the equation is equivalent to Equation (1) and yields an OLS solution β about which statistical inferences such as trend significance (β ≠ 0) become possible. 4 Here and in the following, trends are estimated using OLS, whose solution is given as Other estimates, such as the robust, non-parametric Theil-Sen method (Theil 1950), give overall similar results and are not considered here. For the moment, let us consider the case for p = 1 in Equation (6). This is the form originally proposed by CO in 1949, and by using x t = t it is equivalent to Equation (A10) of Wang and Swail (2001). The following must be observed: CO and Wang and Swail (2001) apply pre-whitening to both sides of the trend equation (i.e. Equation (1)), so that the prewhitening equation appears to be a logical consequence of the trend equation, with a unique β as slope for both the y and the Δ φ y series. In contrast, KS, Zhang et al. (2000), Hamed (2009) and many others, by adding the lagged series as an extra regressor, 5 create an entirely new equation for pre-whitening (equivalent to restricting pre-whitening to the left-hand side of the trend equation). Accordingly, the resulting "trend" is not meant as a linear fit to the original series, a fact that must be stressed when it is used in practice.
For Δ φ , the AR(1) coefficient φ is obtained from the Yule-Walker equations as φ = ρ for autocorrelation ρ. But for realworld applications the autocorrelation ρ is unknown and must be estimated from the data. Mainly three procedures exist how to deal with this case: estimated (feasible) generalized least squares, nonlinear optimization, and maximum likelihood (ML) (Kotz et al. 2005, p. 308). The CO procedure belongs to the first group and proceeds by first estimating ρ from the untransformed data, followed by the OLS estimation of β from Equation (6); these steps are iterated until a sufficient convergence is achieved. 6 Of the second type is the Hildreth and Lu (HL) method: by inserting the solution β = (Δ ρ X) + y back into the original problem of Equation (6), one may formulate another optimization for the AR parameter ρ by minimizing with (ρ, β = (Δ ρ X) + y) giving the HL solution. And, thirdly, the ML solution is obtained from HL by minimizing instead the scaled term 1 À ρ 2 ð Þ À 1=n ε 2 ρ ð Þ. The procedure proposed by Yue et al. (2002) (Y) is altogether different: it does not use prewhitening at all for the trend itself; only the trend residuals undergo pre-whitening, after which they are added back to the original trend. This procedure along with CO, HL, and ML produces consistent 7 estimates of ρ and β if the error structure (see Equation (3)) is known (see Figs S2 and S3). If it is not, the situation is very different, and the optimization algorithms no longer safely converge to identical solutions. The reason lies in the different procedures by which the error function is minimized: while CO alternates between the two independent optimizations for autoregression (ρ) and trend (β), HL/ML optimize both ρ and β uniformly on the trend.
Getting back to the general case with p ≥ 1, the PW (p) solution β may be found by employing one of CO, Y, HL, or ML. For example, the HL solution for the general PW (p) problem is obtained as the optimizer of As mentioned earlier, to obtain Equation (6), both sides of Equation (1), i.e. the dependent and independent variables, are pre-whitened. Otherwise, the trend parameters β in Equations (1) and (6) are not per se identical, which applies to KS and other estimators, and which also creates a low bias as is briefly sketched below and in Appendix E. In my short comment (Bürger 2022) the predictor incorrectly was pre-whitened, so that that version (called "vS" there) did not reflect the original KS approach; a second error in the comment regarding the use of the backshift operator will be addressed below. 8 Table 1 gives a summary of the compared methods. For a group of water level observations the potential benefits and pitfalls of using the four mentioned pre-whitening recipes -CO, HL, ML, and KS -shall be explored, relative to applying no pre-whitening at all (equivalent to method Y for trend size). The data are monthly readings from 23 lakes and 17 groundwater wells in Northern Germany between 1985 and 2013 (336 months). They were previously analysed by Lischeid et al. (2021). Out of the 40 series, they declare 11 to exhibit a significant trend. Here, their analysis is repeated and their results can be replicated, provided that the scaled variant of Y is used and not the original Y as reported. As mentioned, this was up until recently the default of the zyp package, and as such it was in fact used by Lischeid et al. (pers. comm (2001) but without a decisive rescaling that makes the latter equivalent to CO. 7 The estimate converges to the true value with increasing time series length. 8 A corrigendum is in preparation with an intended link to this study.
generally determined using the non-parametric Mann-Kendall test at a 5% level, under the premise that the basic model of Equation (1) holds for the pre-whitened series.
Before analysing data and trends themselves, Monte Carlo experiments with synthetic data are conducted. For each station a trend is estimated ad hoc using the OLS estimator, and the coefficients of an AR(p), p ≤ 2, process are fitted to the detrended data, using the Yule-Walker equations. With these coefficients a sample of 1000 AR(p)-series, p = 1,2, of the same length is generated and added back to the original trend, thus forming station surrogates that have pre-defined linear trends and noise characteristics. From these surrogates a trend is estimated using the five pre-whitening methods, and the resulting trend distribution is compared against the pre-defined trend. To check estimator consistency, the experiments are repeated with a series length of 10 000 and a sample size of 10.
After dealing with PW (1) , the model PW (2) is explored. While the generalization of the HL model to orders p ≥ 1 is straightforward, this does not hold for the ML approach. For the extra scaling required by ML an ad hoc approach is used, with the corresponding error term being approximated as 1 À jjφjj 2 ð Þ À 1=n ε 2 φ ð Þ.

Autocorrelation
Most of the monthly water level variations reveal a fairly strong seasonal cycle, as shown in Fig. 2 for the station Carwitzer See, and it represents a considerable portion of the total variations. Those are, apart from a possible trend, governed by longer, multiannual fluctuations whose origin is (for now) unknown and which are treated here as stochastic. To demonstrate the effect of such deterministic variations on the estimation of stochastic parameters, Fig 3. displays the process memory across multiple lags by way of the partial autocorrelation function, which measures the autocorrelation that a given lag produces additionally to all previous lags. The figure shows, again for Carwitzer See, the corresponding values up to lag 20, using the detrended series of the original and de-seasonalised 9 water levels. For lag = 1, autocorrelation is near 1 for both series, and for the original levels there is, except for lag = 3, significant memory up to lag = 7. For the de-seasonalised levels the only significant autocorrelations occur at lag ≤ 2. This result extends to all stations, which is why from now on we use the de-seasonalised series. Figure 4 shows the partial autocorrelations at lags 1-3 for all detrended station series. For lag = 1, essentially all Iterative OLS for ρ and β HL Hildreth and Lu (1960) Kulkarni and von Storch (1995)  values are very near 1, merely the station Schwielochsee is smaller. Even for lag = 2 most of the values remain significantly nonzero (with only Bergfeld (GW 10 ) being positive), and just a handful of them are near zero. At lag = 3, except for a few stations all autocorrelations are near zero. Based on the autocorrelation results, an AR process of order 2 is justified to fit most of the level data (unless AR is an unsuitable error model in the first place); the corresponding coefficients are shown in Fig. S1 (Supplementary material). One notes that they are quite similar to the first two partial autocorrelations (according to Yule-Walker theory this is not surprising). distribution for b β is shown in Fig. 6 for Lake Müritz using the AR (1)-surrogates; the AR (2) (2)-surrogates, this does not extend to the ρ estimates, which, as revealed by Fig. S4, have a small negative bias except KS. For the AR(2)-surrogates the bias is milder and the ρspread smaller, a fact that holds essentially for all stations (not shown). The reduced ρ-spread is possibly related to the influence of the -mostly negative -partial autocorrelations for lag = 2 (see Fig. 4). Checking the PW (2) methods against AR (2), Fig. S5 shows small biases for KS, CO, and Y but not for HL/ML. Based on the results for β, from now on analysis of KS is skipped and the focus lies solely on the four methods CO, Y, HL, and ML. That they all achieved comparable results for the surrogates simply reflects the fact that the underlying noise model was correctly specified and the methods were able to discover the corresponding parameters. When turning now to the case of observed water levels, this of course no longer holds. One has to assume that a linear trend exists, and that the residuals follow a certain noise model.

Water levels with PW (1)
To begin, the estimated trends for most stations are shown in Fig. 7; seven stations with problematic estimates are skipped for now. There are roughly equal numbers of upward and downward trends, with most of the groundwater trends being negative. While for the majority of stations the trend estimates are mutually consistent, for some stations they are not, and for some even with large amplitude. For Grosser Glietzensee, for example, the CO, HL, and ML estimates are positive and the Y estimate is negative; for Klein Trebbow (GW) the Y trend is near zero and the other trends are strongly positive. Note that the Y method almost always produces significant trends, unlike the other methods which show few but fairly uniform significant trends, such as for Stechlin, P15 (GW), or P26 (GW). The seven stations with problematic estimates are Poratz, Peetschsee, Fürstener See, Krummer See, Hechtsee, and P12/P42 (GW); here the HL/ML optimization did not converge at all to a realistic value, as discussed further below. Langer See Weisdin is one of the moderate examples for the trend discrepancies, whose display as a time series nevertheless appears more eye-catching (see Fig. 8). Superimposed on rather smooth inter-annual variations is a visible negative trend that is well captured by the non-prewhitening OLS estimate, as provided by the Y method; the prewhitening methods, however, deliver a trend that is roughly half of the OLS trend. Returning now to the problem of divergence, Fig. 9 demonstrates the striking discrepancy of the trend estimates. The observed fluctuations for Lake Peetschsee follow a negative long-term trend until around AD 2010, after which the levels start to rise again quite strongly. The visual appearance of a long-term negative trend is confirmed by the non-pre-whitening OLS estimate provided by the Y method; all other methods, however, estimate an obviously nonsensical positive trend. As will soon become clear, it is the special pattern around AD 2010 that has a crucial impact on all pre-whitening trend estimates. Figure 10 may explain some of the implausible trend estimates of Fig. 9, and why the HL/ML optimization can fail in the seven mentioned cases. It shows the cost function of the trend regression for the station Poratz (GW), for observations as well as one random surrogate. While the surrogate cost reveals a clear optimum near the true value of ρ = 0.967, the observation cost monotonically decrease with ρ → 1, without ever attaining an optimum < 1. For the pair (ρ, β) this creates a runaway optimization leading to ρ → 1, β → ∞ or, in other words, divergence, with trend estimates that may become orders of magnitudes off from the non-pre-whitening estimate (provided by Y). Getting back to the partial autocorrelations of Fig. 4, one notices abnormal values for several of the seven problematic stations. For most stations the very high lag = 1 values are accompanied by strong negative values for lag = 2; not so for the seven stations, whose autocorrelations for lag = 2 are only weakly negative or even positive, adding to the overall memory of the process.

Water levels with PW (2)
This and the other results (Figs 3, 4, and S1 in the Supplementary material) clearly point to the invalidity of the PW (1) pre-whitening scheme for most of the series, which means that a linear model with AR(1) errors is not applicable. The study proceeds now by applying the PW (2) scheme and repeating the analyses of Figs 7-10. For Langer See Weisdin, Fig. 11 shows that the trend estimates are now much closer, with the pre-whitened trends being uniformly slightly less negative than the OLS estimate. And for Lake Peetschsee, as demonstrated by Fig. 12, the estimated trends for CO, HL, and ML are no longer out of proportion; but they are, nevertheless, still unrealistically positive (almost identical to the negative Y trend).
Altogether, Fig. 13 shows that trend estimation now converges for all but three GW stations (Poratz, P42, and Bergfeld). And like the example of Langer See Weisdin (see Fig. 11), for many of them, such as Wolletz (GW), P10 (GW), P15 (GW), P26 (GW), or Plötzensee, the trends become more consistent between the methods. Most of the significant trends are positive. In particular, for CO, only two out of the 40 trends are significantly negative as compared to 13 being positive; and for HL/ML there are four significantly negative and 10 significantly positive trends. 11

PW (2) vs. PW (1)
To compare the reliability and efficiency of the two pre-whitening schemes (PW (1) and PW (2) ), the AR(1)-and AR(2)-surrogates of the Monte Carlo experiments are again employed. The analysis is confined to the CO method because HL/ML estimates are, again, unstable for some stations. Figure 6 exemplified for the CO method and the station Müritz that the β estimate was unbiased for both surrogates. This extends in fact to all stations, and likewise to the PW (2) scheme (see Fig. S6, AR(2) identical and not shown). With respect to estimation efficiency, the two schemes were similar for the AR(1) surrogates, but the AR(2) surrogates were more effectively estimated using the PW (2) scheme, especially for the very ineffective estimates for Hechtsee, Sprockfitz, or P12 (GW), as shown in Fig. S7. Estimates of ρ for PW (1) and of (φ 1 , φ 2 ) for PW (2) are not directly comparable, so they are skipped here.
The overall improvement from increasing the pre-whitening order towards the ultimate goal of uncorrelated residuals is demonstrated in Fig. 14. It shows the lag-1 autocorrelation of the residual errors from method CO for both pre-whitening schemes PW (1) and PW (2) . For the former, except for the Bruchmühle GW stations and one or two others, none of the residuals are indeed white, often with autocorrelations well above 0.4. The PW (2)   Significance results for Y are unreliable and therefore not reported. Using the unverified scaled Y variant for PW (1) , the 11 significant trends -four negative and seven positive -reported by Lischeid et al. (2021) can be reproduced here provided that the series are not de-seasonalized.
Let us now return to the crucial issue of stations for which no realistic trend was estimated (none of them had white residual noise). Recall that two different, potentially conflicting optimizations need to be performed: one for finding the best AR coefficients φ and one for finding the best trend coefficient β. The CO method alternates between the two, while HL/ML entirely ignore the AR condition. Similar to showing the cost function (for trend) from PW (1) in Fig. 10, Fig. 15 plots the corresponding function of Equation (6) from PW (2) ; because that is a two-dimensional function, it is plotted over a line that connects the two different solutions from CO and HL. It shows, first, that HL is now at a clear minimum, unlike for PW (1) , but also that CO is not. The discrepancy between CO and HL may be seen as a measure for the influence of the AR condition on CO.

Long memory and a hockey stick pattern
A striking impression of how far the estimated trend can deviate from the original non-pre-whitening estimate (provided by Y) was seen in Fig. 9 for Peetschsee. The observed series slowly decreases from 1985 to 2010 and then quickly rises to the initial levels, a figure that resembles a hockey stick. The linear OLS estimate (given by method Y) reflects what would be a familiar trend fit. The HL/ML estimates, however, appear overly negative, and the CO trend, being even strongly positive, is all but plausible.
Several station levels show this pattern of a long, slow decrease until around AD 2010, followed by a strong increase, and this hockey stick behaviour essentially coincides with the stations that show problematic trend estimates, displayed in Fig. 16 in a normalized way. A number of questions arise: first, one should acknowledge that under the main assumption of AR(p)-noise (see Equation (1)), the synchronous occurrence of a pattern like in the figure is close to impossible. Unless the pattern is created externally by a so far unknown force -a candidate being the monthly weather -the noise model is wrong. Many of the series, most notably the seven series of Fig. 16, may indeed not even have stationary trend residuals, contrary to what the partial autocorrelations of Fig. 4 suggest. Numerous attempts exist, mostly from the field of econometrics, to distinguish stationary from nonstationary behaviour (so-called unit-root tests; see Kennedy 2008, p. 19.5). All of them suffer from poor statistical power, however, especially for short and autocorrelated series as is the case here, so it will not be touched here.  Fig. 11Monthly water levels for Peetschsee and trend estimates using the same methods as in Fig. 7, but for the PW (2) model. Perhaps a more flexible approach is to consider "physicallooking" but entirely random processes. Unlike what is known as short memory, such as AR(p), such processes are characterized by long memory or long-range dependence; hockey stick fluctuations resembling Fig. 16 are not atypical of long memory, which may explain why they were first "discovered" in hydrological time series (Hurst 1951, Mandelbrot andWallis 1968). Recognizing their potential to be mistaken for deterministic (physical, economic) long-term processes has considerably expanded the range of null hypotheses in corresponding statistical tests (Cohn andLins 2005, Serinaldi et al. 2018). Long memory is characterized by the Hurst parameter, H, being greater than 0.5, that number being the unique value for short-memory processes. Not surprisingly, however, unless the time series are very long, estimation of H is positively biased, and more so in the presence of autocorrelation (Wallis and Matalas 1970). This behaviour is displayed remarkably clearly in Fig. 17, which shows H estimates for the 40 water level series. For all stations H > 0.5, and the seven stations of Fig. 16 are among those with largest H. However, doing the same with the short-memory surrogates yields very similar results; consequently, for just a few stations the level coefficients turn out to be significantly greater than their surrogate counterparts (based on the 95% confidence from 10 000 runs). Unfortunately, the  Figure 14. Autocorrelations of pre-whitened residuals for method CO, using PW (1) (blue) and PW (2) (green), with the confidence level for zero (grey).
-1 -0.5 0 0.5 1 1.5  Figure 15. Trend residual variance ("cost") for the station Poratz (GW) (black) using PW (2) , that is, a function of the AR(2) coefficients (φ 1 , φ 2 ), where φ 2 =l(φ 1 ) is parameterized as a line passing through the HL (φ 1 = 0) and CO (φ 1 = 1) solutions for the station. available data end in 2013, and based on this limited length no reliable judgement about the validity of any of the noise models can be made. Whatever the nature of the hockey stick may be, it has fairly drastic consequences for the pre-whitening. A closer inspection of the original definition of PW (1) , Δ ρ y t = y t -ρy t-1 , reveals that, in fact, because all values y are positive and 0 ≤ ρ < 1, negative trend contributions are damped relative to standard differencing and positive ones amplified; a similar argument applies to PW (2) , and the effect grows with ρ approaching 1. As a result, the overly positive contributions after AD 2010 dominate the estimated trend for the whole pre-whitened series, thus explaining all of the implausible trend estimates seen above, and likely also the related divergence problems.

Time asymmetry of pre-whitening
The temporal asymmetry that lies in the unbalanced effects of negative and positive trend contributions is also unveiled in an interesting twist of Bürger (2022). For that short comment, the backshift operator was incorrectly implemented as a foreshift operator, calculating Δ ρ y t = y t -ρy t+1 instead. This reversed the hockey stick effect and led to more consistent trends overall. Using the example of Peetschsee, the opposite effects are demonstrated in the the final Fig. 18, which shows the PW (1)estimated trends using backshift and foreshift operator. This incorrect implementation reveals, in reverse, the somewhat unsettling fact that pre-whitening is not generally invariant under time reversal. That invariance describes the familiar intuition that if a trended series is reversed in time, both series and trend are reversed -that is, they are given by a reflection at a vertical line through the centre of the series; a series with a growing trend would thus become shrinking and vice versa. This symmetry holds for OLS-estimated trends if the errors are white; it is violated for pre-whitened trend estimates and errors of higher order such as AR(p), p ≥ 1.

Conclusions
Four pre-whitening solutions -CO, Y, HL, and ML -were applied to tackle the problem of memory (autocorrelation) for trend estimates. They were originally developed for solving order-1 pre-whitening (PW (1) ) problems, which in this study was extended quite straightforwardly to the general case of order p ≥ 1.
The pre-whitening solutions were applied to a set of 40 monthly water level series (lakes and groundwater) from Northern Germany. The levels show very strong memory, going beyond AR(1) and in a few cases perhaps even beyond AR(2). Using ad hoc estimates of AR-and trend parameters, for each series a large ensemble of surrogate series was generated, and the estimates of the prescribed trends were found to be reliable for both PW (1) and PW (2) ; estimator efficiency was generally better for PW (2) . For the observations themselves, the majority of estimates turned out to be reasonable on the one hand and consistent between the four schemes on the other, and more so for PW (2) . The sign of these trends was nevertheless not uniform across the stations (some trends decreasing, some increasing). It was found that PW (2) was sufficient for most level series, and no significant improvement is achieved by further increasing the pre-whitening order beyond p = 2. And, generally, if the noise is AR(p), then PW (p) is the candidate of choice for pre-whitening.
Of the four methods, the method of Cochrane and Orcutt (1949) (CO), which is equivalent to Wang and Swail (2001) (see also Zhang and Zwiers 2004), turns out to be the most convincing, following two principles: first, by applying prewhitening to both sides of the trend equation it provides the logical connection between the original trend model and its pre-whitened version. And, second, the optimization iteratively and separately respects autoregressive and trend requirements for finding the coefficients ρ and β.
Warnings have been raised repeatedly against other forms of pre-whitening, most notably against the procedure Y of Yue et al. (2002). In spite of that, Y is still the most applied procedure for trend pre-whitening, at least in hydrologyone may suspect this is because a very popular R package, zyp, exists (with 71 000 downloads at the time of writing). The method is not based on a noise model like CO or HL/ML and yields too many false positive trend detections. Fortunately, the package has been updated recently to use as default the Wang and Swail method.
For seven level series, all "plagued" by very strong autocorrelation, neither PW (1) nor any PW (p) , p > 1, gave satisfactory results, with some estimates being strikingly off the non-prewhitened (OLS) estimate. It was determined that a strong rise of levels after AD 2010, having the shape of a hockey stick, was responsible for the estimation failure. The reason is found in the basic equation of order-1 pre-whitening, y t -ρy t-1 : for generally positive values (such as the lake levels), pre-whitening has the effect that negative partial trends are damped and positive ones amplified, so that for a hockey stick the latter dominates the former. And this was just a drastic example of how pre-whitening generally entails the violation of a rather fundamental principle of function fits, namely the invariance under time reversal.
Unforeseen effects like this should be a warning that prewhitening may in fact yield misleading and implausible results, and this holds regardless of the presence of long memory or nonstationarity, or even a hockey stick. If trend significance is of interest, other more direct and simpler approaches such as  Figure 18. Like Fig. 9 just for CO, using normal pre-whitening PW (1) (solid) and in non-standard mode using a foreshift operator (dashed line).
variance correction methods are to be preferred. But then the questions of short memory and stationarity must be settled. And all these questions hinge, in turn, on the condition that the series under investigation are long enough relative to their autocorrelation. At least the seven problematic water level series are not.