Modeling Extreme Events: Time-Varying Extreme Tail Shape

Abstract We propose a dynamic semiparametric framework to study time variation in tail parameters. The framework builds on the Generalized Pareto Distribution (GPD) for modeling peaks over thresholds as in Extreme Value Theory, but casts the model in a conditional framework to allow for time-variation in the tail parameters. We establish parameter regions for stationarity and ergodicity and for the existence of (unconditional) moments and consider conditions for consistency and asymptotic normality of the maximum likelihood estimator for the deterministic parameters in the model. Two empirical datasets illustrate the usefulness of the approach: daily U.S. equity returns, and 15-min euro area sovereign bond yield changes.


Introduction
This article proposes a dynamic semiparametric framework to study time variation in tail fatness for long univariate time series.The new method builds on ideas from Extreme Value Theory (EVT) and uses a conditional Generalized Pareto Distribution (GPD) with time-varying parameters to approximate the tail beyond a given threshold.The GPD is an appropriate tail approximation for most heavy-tailed densities used in financial economics, econometrics, and actuarial sciences; see, for example, Embrechts, Klüppelberg, and Mikosch (1997), Coles (2001), McNeil, Frey, and Embrechts (2010, chap. 7), and Rocco (2014).As a result, the GPD plays a central role in the study of extremes, comparable to the role of the normal distribution when studying observations in the center of the distribution.Our framework allows for studying time-variation in the incidence of such extremes.
The time-varying tail shape and tail scale parameters in our model are driven by the score of the GPD density; see Creal, Koopman, and Lucas (2013) and Harvey (2013).As a result, the model is observation-driven in the terminology of Cox (1981) and its time-varying parameters are perfectly predictable one step ahead.In addition, the log-likelihood function is known in closed form and allows for parameter estimation and inference via standard maximum likelihood methods.Our results show that our model is able to recover the time-varying tail shape and tail scale parameters well in both simulated and empirical data.In addition, the model recovers time-variation in EVT-based market risk measures such as Value-at-Risk (VaR) and Expected Shortfall (ES).This is the case even if the model is misspecified CONTACT André Lucas a.lucas@vu.nlVrije Universiteit Amsterdam, De Boelelaan 1105, 1081 HV Amsterdam, The Netherlands.Supplementary materials for this article are available online.Please go to www.tandfonline.com/UBES.
or the GPD approximation is not exact.The latter is particularly important in our finite sample setting, where the limiting EVT result of the GPD can only hold approximately given the choice of a finite exceedance threshold in any particular dataset.
We illustrate our modeling framework using two applications: U.S. equity log-returns, and changes in euro area sovereign bond yields.1Each dataset consists of two time series that we model separately.We first consider daily log-returns for a stock index (S&P500) and an individual stock (IBM).The S&P500 log-returns range from July 3, 1962 to December 31, 2020, while the IBM stock returns range from January 2, 1926 to December 31, 2020.Focusing on the left tail, and controlling for potential time-variation in the EVT thresholds, we find that both GPD parameters vary significantly over time.The tail shape varies between approximately 0.05 and 0.25 for the S&P 500, and between 0.05 and 0.35 for IBM.These values imply a maximum moment of order 1/0.25 = 4 to 1/0.05 = 20 for the S&P 500, and 3 to 20 for IBM.Confidence bands around the filtered time-varying tail shape parameters suggest that the tail shape parameter is almost always far from the thin-tailed setting.
Our second illustration demonstrates how additional control variables can be included to capture time-variation in tail shapes and tail scales.Specifically, we study changes in Italian and Portuguese five-year benchmark bond yields, sampled at a 15-min frequency, during the extremely adverse euro area sovereign debt crisis between 2010 and 2012.Again, we find that both GPD parameters vary significantly over time.The tail shape varies between 0.1 and 0.4 for Italy and between 0.05 and 0.6 for Portugal, implying moment ranges of 2.5-10 and 1.7-20, respectively, and thus incidences of extreme fat tails.We also find that part of the variation can be explained by including central bank bond purchases as an additional covariate, and provide a way to translate the estimated impact coefficients into their economically-interpretable impact on market risk measures such as VaR.
Our article is closely related to a growing strand of literature on modeling time-variation in EVT tail parameters.Several papers propose methodology to study time variation in the tail index.Davidson and Smith (1990), Coles (2001, chap. 5.3), and Wang and Tsai (2009), among others, also index the GPD tail parameters with time subscripts and equip them with a parameterized structure.Our approach is different in that their "tail index regression" approach requires conditioning variables that explain (all of) the tail variation.Such variables are not always available.By contrast, our "filtering approach" does not require such conditioning variables, and is arguably better suited for the real-time monitoring of extreme equity or bond market risks.Second, Quintos, Fan, andPhillips (2001), Einmahl, de Haan, andZhou (2016), Hoga (2017), and Lin and Kao (2018) derive formal tests for a structural break in the tail index.A number of subsequent studies applied such tests to financial time series data.Werner and Upper (2004) identify a break in the tail behavior of high-frequency German Bund future returns.Galbraith and Zernov (2004) argues that certain regulatory changes in U.S. equity markets have altered the tail index dynamics of equities returns.de Haan and Zhou (2021) propose a nonparametric approach to estimating the extreme value index locally.Our article adds to the literature by proposing a framework that allows us to study both the tail shape and tail scale dynamics directly in a semiparametric way.Explanatory covariates can be included in the dynamics of both parameters, and likelihood ratio tests are available to test economically relevant hypotheses.Finally, unlike Patton, Ziegel, and Chen (2019), our tail VaR and ES dynamics explicitly account for fat tail shape beyond a threshold as emerging from EVT.The dynamics based on the score for the GPD contain weights for extreme observations.Such weights are absent in the elicitable score functions of Patton, Ziegel, and Chen (2019).The resulting dynamics in our model are more robust, particularly for the ES.
Whereas de Haan and Zhou (2021) take a nonparametric perspective, the methodological part of this article is closest to Massacci (2017), who also proposes a dynamic parametric model for the GPD parameters.Our framework is different from Massacci (2017) in that we specify both GPD parameters as functions of their respective scores.Massacci (2017), by contrast, uses only the score of the first (tail index) parameter to drive both parameters.This is not optimal in the sense of Blasques, Koopman, and Lucas (2015), who require that each time-varying parameter is associated to its own score.Our article further differs from Massacci (2017) in that we suggest time-varying EVT thresholds to locate the boundary between the center of the distribution and its tail.This implies that we do not need to assume that the time series at hand has no volatility clustering, nor that we need to pre-filter for such volatility clustering.Absence of conditional heteroscedasticity would be hard to defend for the financial data considered in this article.Finally, we differ from Massacci (2017) in that we discuss inference on both deterministic and timevarying parameters by considering the asymptotic properties of the maximum likelihood estimator, provide sufficient conditions for the stationarity and ergodicity and the existence of moments of the factor process and observations, explain how to introduce additional conditioning variables into the model and assess their usefulness in economic terms, and provide Monte Carlo evidence on the model's performance in a range of challenging settings.
We proceed as follows.Section 2 presents our statistical model, the asymptotic statistical properties of the model and the maximum likelihood estimator.Section 3 discusses simulation results.Section 4 illustrates the model using U.S. equity data and euro area sovereign yields data.Section 5 concludes.Derivations and more results are in the web appendix.

Conditional EVT Framework
Consider a univariate time series y t , t = 1, . . ., T, where T denotes the number of observations.This section then describes our model for y t with time-varying tail shape and scale.We assume y t is generated by a conditional probability density function (pdf) g(y t | F t−1 ), where F t−1 = {y t−1 , y t−2 , . . ., y 1 } denotes the information set containing all past data.By keeping the conditional density of y t in its current general form, we stay close to the semiparametric nature of an extreme value-based approach and make no modeling assumptions about the center of the distribution.Alternatively, we could specify a parametric distribution for y t with for instance a time-varying conditional location μ t and scale σ t .The μ t and σ t could then be used to pre-filter the raw y t .We do not pursue such an approach here.First, modeling the center of the conditional distribution leads away from the focus on the tails only, which is the key aspect in an EVT-based approach.Second, designing an additional model for time-varying location and scale would create another layer of complexity to the model, with accompanying model risk and parameter uncertainty.We therefore keep the general form of g(y t | F t−1 ) and focus on its tail using a dynamic extension of arguments from extreme value theory (EVT), similar to Patton's (2006) extension of copula theory to the dynamic, observation driven setting.
We assume g(y t | F t−1 ) has heavy tails with time-varying tail index α t > 0. A familiar example is when g(y t | F t−1 ) is a univariate Student's t distribution with ν t = α t degrees of freedom.Other examples include the Pareto, inverse gamma, log-gamma, log-logistic, F, Fréchet, and Burr distribution with one or more time-varying shape parameters (for details and further discussion see e.g., Johnson, Kotz, and Balakrishnan 1994;Embrechts, Klüppelberg, and Mikosch 1997;McNeil, Frey, and Embrechts 2010, chap. 7.3).Rather than modeling the (dynamic) tail shape of an arbitrarily chosen parametric family of distributions, we appeal to well-known results from the EVT literature: the conditional cumulative distribution function (cdf) G(y t | F t−1 ) of y t can under very general conditions be approximated by ))P(x t ; δ t , ξ t ) with x t = y t − τ t for sufficiently high thresholds τ t ∈ R + .More precisely, we have lim for parameters ξ t = α −1 t and δ t , both possibly depending on τ t , and with Y t denoting the random variable corresponding to the realization y t .Here, P(x t ; δ t , ξ t ) denotes the cdf of the Generalized Pareto Distribution (GPD), with cdf and pdf given by , e.g., McNeil, Frey, and Embrechts 2010).The quantity x t = y t − τ t > 0 is the so-called peak-overthreshold (POT), or exceedance, of heavy-tailed data y t over a pre-determined threshold τ t , and δ t > 0 and ξ t > 0 are the tail scale and tail shape parameter of the GPD, respectively.Most continuous distributions used in statistics and the actuarial sciences lie in the Maximum Domain of Attraction (MDA) of the GPD (see McNeil, Frey, and Embrechts 2010, chap. 7.1), meaning that they allow for the above tail shape approximation.By focusing on the tail area directly using EVT arguments, we avoid having to make more ad-hoc assumptions on the parametric form of the tail.The result in (1) is a limiting result.In any finite sample, the threshold τ t has to be set to a specific, finite value, such that the GPD approximation will be inexact and the distribution is in that sense misspecified.This will also be the case in our setting.The score-driven updates that we define later on for ξ t and δ t , however, still ensure that the expected Kullback-Leibler divergence between the approximate GPD model and the true, unknown conditional distribution is improved on average at every step for sufficiently small steps, even if the GPD model is (partially) misspecified; see Blasques, Koopman, and Lucas (2015).
The choice of the threshold τ t is subject to a well-known bias-variance tradeoff; see, for instance, McNeil and Frey (2000).In theory, the GPD tail approximation only becomes exact for τ t → +∞.A high threshold, however, also implies a smaller number of exceedances y t > τ t , and more estimation error for the parameters of the GPD.Common choices for τ t from the literature are the unconditional 90%, 95%, and 99% empirical quantiles of y t ; see Chavez-Demoulin, Embrechts, and Sardy (2014).In our setting, such choices are less useful as τ t varies over time.As we explain later, we use the approach of Patton, Ziegel, and Chen (2019) to set τ t dynamically in line with the data.

Time-Varying Parameters
A key step in (1)-( 2) is that we use the conditional probabilities based on the information set F t−1 .As a result, the tail parameters ξ t and δ t become time-varying.To capture this timevariation, we model (ξ t , δ t ) using the score-driven dynamics introduced by Creal, Koopman, and Lucas (2013) and Harvey (2013).In our time series setting, this implies that both δ t and ξ t are measurable with respect to F t−1 .We ensure positivity of δ t and ξ t by using an (element-wise) exponential link function The transition dynamics for f t are given by where vector ω = ω(θ) and matrices depend on the deterministic parameter vector θ , which needs to be estimated.The scaling matrix S t may depend both on θ , f t , and F t−1 .Effectively, the recursion (4) updates f t at every point in time via a scaled steepest ascent step to improve the expected fit to the GPD; see Creal, Koopman, and Lucas (2013) and Blasques, Koopman, and Lucas (2015).The score of (2) required in ( 4) is given by where ln(•) denotes the natural logarithm; see Web Appendix A.1 for a derivation.We take A i and B j as diagonal matrices.
Following Creal et al. (2014) we select the square-root inverse conditional Fisher information of the conditional observation density to scale (5), that is, S t = L t , with L t the choleski decomposition of the inverse conditional Fisher information matrix Compared to so-called inverse information matrix scaling, the current scaling matrix has the advantage that the conditional variance of the scaled score s t is the unit matrix, that is, E[s t s t | F t−1 ] = I 2 .This gives the parameters A i a more natural interpretation, similar to the standard deviations of the state innovations in a nonlinear state-space model.For the GPD, we have see Web Appendix A.2 for a derivation.Combining terms yields the scaled score Though the first element of the scaled score in (7) seems unstable for ξ t near zero, the expression actually has a finite left limit equal to lim The first element (left panel) and second element (right panel) of s t in ( 7) is plotted against x t for different values of ξ t and δ t .
Figure 1 plots the two elements of ( 7) as a function of x t for different values of ξ t and δ t .The behavior of the scaled score is intuitive: large x t s imply that both ξ t and δ t are adjusted upwards and tails thus become fatter.The adjustments are largest when the current tail index ξ t and tail scale δ t are low.It is precisely in such cases that observing a large x t is unlikely.If it occurs nonetheless, the parameters are (strongly) adjusted to account better for similar effects in the future.
The news impact curves become increasingly concave for lower values of ξ t .For such parameters, large x t values are already more likely due to the fat-tailed nature of the GPD itself.As a result, the parameters need to be adjusted less if a large x t actually materializes.This resembles the well-known robustness properties of score-driven updates in the context of time-varying volatility modeling; see Creal, Koopman, and Lucas (2013) and Harvey (2013).It also distinguishes our current set-up sharply from an approach directly based on quantile functions; see Patton, Ziegel, and Chen (2019) and Catania and Luati (in press), in particular for risk measures such as ES.In Patton, Ziegel, and Chen (2019), ES reacts linearly rather than concavely to the VaR exceedance.This can result in noisy or unstable ES estimates.Figure 1 illustrates that the score-driven approach is less susceptible to such instabilities and can therefore result in more interpretable parameter paths.
We also note that small realizations of x t imply downward adjustments of both ξ t and δ t , up to the point where x t becomes very small.For very small x t > 0, ξ t is adjusted upward: observations near the center of a fat-tailed distribution signal increased peakedness (=leptokurtosis) and thus higher ξ t ; compare Lucas and Zhang (2016) for a similar effect in the Student's t setting.
When there is no tail observation, that is x t = y t − τ t ≤ 0, the new observation carries no information about ξ t and δ t ; see (McNeil, Frey, and Embrechts 2010, Chapter 7).In such cases we set the score to zero and continue to use (4) to update f t .Long consecutive stretches of zero scores could potentially lead to mean-reverting paths for f t , and thus (ξ t , δ t ), with only discrete "jumps" when new observations x t > 0 arrive following such stretches.2If so, smoothing the scaled score (7) can help to spread out the impact of the new information in x t .Smoothing the scaled scores can also help when additional conditioning variables z t are available at every t = 1, . . ., T; see ( 9) and ( 11) below.Lagged values of the scaled score can be taken into account via an exponentially-weighted moving average specification where λ ∈ [0, 1) is an additional parameter to be estimated or to be fixed ex-ante; see Web Appendix A.3 for the derivation.This alters the values of (p, q) = (1, 1) in (4) to (p, q) = (2, 1).While s t is most often zero, st is not.Clearly, (4) is a special case of (8) for λ = 0.The smoothing approach in (8) is similar to the approach in Patton (2006), who uses up to 10 lags of the driver (in our case the score) to smooth the dynamics of the time-varying parameter.
The transition equation for f t can be extended further if additional conditioning variables are available by respecifying (8) as where all explanatory variables are stacked into the column vector z t , and C is a conformable matrix of impact coefficients that needs to be estimated.We illustrate this in our second application in Section 4.

Time-Varying Thresholds
For the dynamic thresholds τ t , we use the specification suggested by Patton, Ziegel, and Chen (2019), where ω τ ≡ (1 − b τ ) • qκ , qκ is the (observed) unconditional κ-quantile of y t , a τ > 0 and 0 < b τ < 1 are two parameters to be estimated, and (1 − κ) is a sufficiently small tail probability corresponding to the dynamic quantile τ t , such as, for example, 10% or 5%.We initialize τ t at τ 1 = qκ .The recursive specification ( 10) is driven by a zero mean innovation process since responds to quantile exceedances in an intuitive way: the next quantile value τ t+1 receives a positive shock of a τ κ if y t > τ t , that is, if the previous quantile was exceeded, and a negative shock of −a τ (1 − κ) otherwise.For 0 < b τ < 1, the empirical unconditional quantile qκ serves as a long-term attractor for (10).The transition equation for τ t can also easily be extended to include exogenous variables z t as in for a suitable column vector of coefficients c τ .

Interpretation of Time-Varying Parameters
It is important to briefly comment on parameter interpretability.The tail shape parameter ξ t can always be interpreted as observation y t 's inverse conditional tail index α −1 t .By contrast, the estimated tail scale parameter δ t need not have a straightforward interpretation in terms of y t 's conditional scale σ t .For example, assume that y t has a GPD conditional distribution with timevarying tail shape parameter α −1 t and scale σ t .Web Appendix B.1 demonstrates that the conditional tail probability then also has a GPD shape (exactly, not only approximately).The tail shape parameter is the same as that of the center: ξ t = α −1 t .However, the tail scale parameter δ t is very different from the scale parameter σ t that applies to the center, in particular δ t = σ t + α −1 t • τ t .As a result, δ t increases with the threshold τ t , varies positively with the tail shape parameter ξ t , and, importantly, should not be expected to provide a good estimate of the scale parameter σ t that applies to the center of the distribution of y t .A similar result can be derived if y t were Student's t-distributed with scale σ t and degrees of freedom parameter α t when the tail probability P[Y t ≤ y t + τ t | Y t > τ t , F t−1 ] only has an approximate GPD shape; see Web Appendix B.2.We return to this issue in our simulation Section 3, where we consider pseudo-true values for ξ t and δ t to benchmark how well the model performs in terms of tracking an unknown data generating process.

Stationarity and Moments
The score-driven dynamics for ξ t and δ t are highly nonlinear.Still, the structure of the model allows us to obtain sufficient conditions for the stationarity and ergodicity (SE) of f t and x t , and for the existence of unconditional moments of f t .To this end, we look at our statistical model in ( 4)-( 8) as a data generating process (DGP).
Given the bivariate structure of our time-varying parameter model, the process f t t∈Z can be viewed as a stochastic recurrence equation (SRE) of the form where : R 2 × R + × → R 2 is a Borel measurable function, and θ 0 ∈ is the true parameter vector contained in the parameter space ⊂ R 6 .We make the following two assumptions.
Assumption 1. Assume that x t is drawn from the GPD distribution defined in (2), such that the random variable t := ξ −1 t ln (1 + ξ t x t /δ t ) is an independent and identically distributed noise term with unit exponential distribution, t iid ∼ Exp(1).
Assumption 2. For some integer r ≥ 1, let where M = √ trace (M M) is the Frobenius norm of a realvalued matrix M ∈ R m×n , and where ˙ t f t ; θ 0 := B+A ∂s t ∂s t with Assumption A1 considers the model as the data generating process.By inverting the GPD cdf in (2), we obtain that t = ξ −1 t ln (1 + ξ t x t /δ t ) = − ln(1 − u t ) ∼ Exp(1) for a standard uniformly distributed u t .Assumption A1 requires that these uniform random variables constitute an iid process.Assumption A2 requires contraction properties of the bivariate stochastic recurrence equation.We note that we use the general form of the r-fold contractions, that is, r iterations of the transition equation of f t .In univariate models, sharp contraction conditions (that ensure stationarity and ergodicity of the model) can usually be found by assuming that r = 1; see, for example, Blasques et al. (2022).In multivariate systems, however, the contraction condition with r = 1 is often violated, resulting in very small or uninteresting stationarity regions.Working with the more general condition is therefore important.We also show this in Figure 2. The idea behind A2 is that, when working with a multivariate system, one can still ensure the existence of an SE solution if the system becomes contractive eventually, that is, after a sufficiently large number of r iterations.The contraction condition in A2 results in a meaningful (nondegenerate) SE region, because the supremum of ˙ t f t ; θ 0 over f has no degeneracies.
Using A1 and A2, we can verify the conditions of (Bougerol 1993, Theorem 3.1).This allows us to show that a unique SE solution exists for the bivariate score-driven process {f t } t∈Z , and for the data {x t } t∈Z as generated by the DGP in Section 2.1.We summarize this in the following theorem.
Theorem 1.Consider the model as defined by ( 2) and (4).Then under A1 and A2, the SRE in (12) admits a unique stationary and − − → denotes exponentially fast and almost sure convergence (Straumann and Mikosch 2006).In addition, the process {x t } t∈Z generated by the model evaluated at θ 0 is stationary and ergodic.
Web Appendix C.1 presents the proof of Theorem 1. Next to the SE properties of the f t process, we can also establish the existence of unconditional moments.This can be useful for proving the existence of moments of the log-likelihood function and its derivatives.We note that the contraction condition stated in A2 by its own is insufficient to ensure bounded unconditional moments of f t in the DGP.Requiring the existence of moments typically makes the admissible parameter space smaller.We also note that the existence of unconditional moments of f t does not imply the same (un)conditional moments for x t .For instance, even if f t has a finite 4th order moment, x t may not if ξ t can reach levels higher than 1/4.We have the following result.
Theorem 2. Consider the model as defined by ( 2) and (4), and let A1 and A2 be true.If in addition for some p ≥ 1, then the unique stationary and ergodic solution { ft } t∈Z to the SRE in ( 12) satisfies E[ ft p ] < ∞.
Web Appendix C.2 provides the proof of Theorem 2. Again, Theorem 2 makes use of r-fold contraction conditions rather than the standard r = 1 case.
To give some insight into the size of the SE and moments regions, we compute them numerically in Figure 2. We let two (out of the six) parameters in vary at a time.Computing the regions is far from trivial.It requires numerically solving a maximization problem embedded inside an integration problem.As a result, the maximization problem has to be solved for every value of the integration variable.Details are provided in Web Appendix D.
Figure 2 clearly illustrates the importance of multiple unfoldings of the SRE.If only r = 1 iteration is used, the SE and finite-moments regions are typically small, or even empty; see for instance the small darkest blue region in panel 2(a) for r = 1, or the fact that panel 2(d) only shows contracting behavior for r ≥ 17 iterations given the empirical estimates.For r = 1, empirical estimates typically lie outside the SE region.However, by iterating the SRE forward, the SE and finite-moments regions grow considerably to the extent that they also encompass the empirical estimates.For instance, the model may not be SE when evaluated at the empirical estimates for r = 1.For r = 20, 40, or even larger, however, the model's SE region at the empirical estimates increases such that even fourth order moments of f t exist.Given the S&P500 data (on which Figure 2 is based) are at a daily frequency, we conclude that a time-varying parameter model may not have guaranteed contraction properties at the daily frequency, but may still have such contraction properties at a monthly (r = 22) or lower frequency.The numerical results in the figure stress that analytical conditions for SE, as often encountered in the literature for r = 1, may be quite uninteresting if not accompanied by a numerical check for the size of the resulting parameter space.

Parameter Estimation
Observation-driven time series models such as (1)-( 11) have the appealing feature that the log-likelihood is known in closed form.For a given set of time series observations x t = 1{y t > τ t }•(y t −τ t ) for t = 1, . . ., T, the vector of unknown parameters θ can then be estimated by maximizing the log-likelihood function of the GPD with respect to θ .The average log-likelihood function is given by where T * = T t=1 1{x t > 0} is the number of POT values in the sample.Maximization of ( 16) can be carried out using a conveniently chosen quasi-Newton optimization method.
To establish consistency and asymptotic normality of the maximum likelihood estimator (MLE), we define the MLE θ as We establish these asymptotic properties by showing that the conditions for Lemma 1 in Jensen and Rahbek (2004) are satisfied.The main point to note is that the first three derivatives of the log-likelihood function (and of f t ) with respect to θ are stationary and ergodic stochastic recurrence equations that satisfy specific moment conditions.We make the following additional assumption on the parameter space .
Finally, we impose conditions on the score vector ∇ t , as defined in (5), and its derivatives up to third-order.Let ∇ t (θ ), ∇ 2 t (θ ) and ∇ 3 t (θ ) denote the score vector and first two derivatives with respect to f ξ t (θ ) and f δ t (θ ), evaluated at some θ ∈ V(θ 0 ), where V(θ 0 ) denotes a small neighborhood of the true parameter vector θ 0 .By imposing the conditions in Assumption 4, we can rule out explosive behavior of the first three derivatives of the log-likelihood function with respect to θ .Similar boundedness assumptions have been used in for instance Hetland, Pedersen, and Rahbek (in press) or Hafner and Wang (in press).Effectively, this further reduces the size of the parameter space (see also Blasques et al. 2022).Web Appendix E.1 provides the explicit analytic expressions for each of these derivatives.
Assumption 4. The score vector ∇ t (θ ), and its derivatives ∇ 2 t (θ ), and ∇ 3 t (θ ) are p-dominated uniformly in V(θ 0 ) and t by dominating functions D 1,t , D 2,t , and D 3,t , respectively, such that sup where D 1,t , D 2,t , and D 3,t are p-integrable uniformly in t for p > 0, that is The following theorem establishes the asymptotic properties of our MLE.
Theorem 3. Consider the model as defined by ( 2) and ( 4).Let Assumptions 1-3 hold true.Then θ P − → θ 0 as T * → ∞.If, in addition, Assumption 4 holds true, as well as the contraction condition (15) in Theorem 2 for p = 4, then where I denotes the Fisher information matrix evaluated at the true parameter vector θ 0 .
The proof of Theorem 3, together with detailed derivations of the derivatives up to the third-order of the log-likelihood function L(θ |F T ), and of the (bivariate) score-driven process {f t (θ )} t∈Z , for each θ ∈ can be found in Web Appendix E.2.
All our results are conditional on the time-varying thresholds τ t .The parameters a τ and b τ for τ t in (10) cannot be estimated using ( 16).Another objective function is needed in this case.We suggest using the average quantile regression check function of (Koenker 2005, chap. 3).The optimization problem can be formulated as min where ρ κ (u t ) = u t (κ − 1{u t < 0}), and τ t evolves as in (10).See also Engle and Manganelli (2004) and Catania and Luati (in press) for the use of this objective function in a different dynamic context.In practice, we estimate all thresholds τ t via (18) before maximizing (16).

Confidence Bands for Tail Shape and Tail Scale
Given the maximum likelihood estimate θ , confidence (or standard error) bands around ft = f t ( θ) allow us to visualize the impact of estimation uncertainty.Quantifying the uncertainty of the estimated parameter paths is important, as classical EVT estimators of time-invariant tail shape parameters can have sizeable standard errors; see for example, Hill (1975) and Huisman et al. (2001).Web Appendix F explains how simulationbased and in-sample analytic confidence bands around ft can be obtained.These bands are conditional on the estimated timevarying thresholds τt , and do not incorporate the associated estimation uncertainty of the thresholds.

Market Risk Measures
Market risk measurement is a major application of EVT methods in practice; see for example, McNeil, Frey, and Embrechts (2010).
The GPD approximation ( 1)-( 2) yields useful closed-form estimators of the VaR and ES for high upper quantiles γ > G(τ t | F t−1 ); see McNeil and Frey (2000) and Rocco (2014).We can estimate the 1 − γ tail probability of y t based on the GPD cdf for Consequently, paths (1) considers the special case of timeinvariant tail shape and scale parameters.Naturally, we would want our dynamic framework to cover constant parameters as a special case.Paths (2) allows the tail shape to vary considerably between 0.2 and 0.8, while keeping the scale (volatility) of the data constant.Paths (3) stipulates that both parameters vary over time.Finally, paths (4) considers the case of synchronized variation in both parameters.Next, we consider three ways to construct the thresholds τ t .First, we use the true time-varying 95%-quantile based on our knowledge of the true density and of α t and σ t .This constitutes an infeasible best benchmark.Second, we construct τ t as the 95%-quantile of the expanding window of data up to time t, that is τ t = Q 0.95 1:t {y 1 , . . ., y t } .Finally, we use the recursive specification (10), initialized at the full-sample quantile τ 1 = Q 0.95 1:T .
Our main evaluation metric for evaluating model performance is the root mean squared error RMSE = where ξ s t is the estimated tail shape parameter in simulation s, ξ s t is the corresponding (pseudo-)true tail shape, s = 1, . . ., S denotes the simulation run, and t = 1, . . ., T is the number of observations in each draw.The RMSE for the tail scale parameter δ t is obtained analogously.The pseudo-true values ξ s t and δs t are obtained by numerically minimizing the Kullback-Leibler divergence between the GPD and the data generating process beyond the true time-varying 95% quantile τ t .As the true conditional density is known at all times in a simulation setting, these pseudo-true benchmarks are easily computed numerically for each s and t.Particularly the GPD tail scale parameter δt may have very different dynamics from σ t , as it combines dynamics in α t and σ t via the EVT limiting expression in (1); see also Section 2.1.4.

Simulation Results
For the first set of DGPs, we are particularly interested in two issues: first, what is the effect of increasing misspecification by  4), in rows).Thresholds τ t , τt , and τ * t denote, respectively, (i) the infeasible true time-varying threshold, (ii) the empirical quantile associated with an expanding window of observations y 1 , . . ., y t , and (iii) the estimated conditional quantile using (18) and a suitably calibrated a τ = 0.25 to speed up the computations.We consider 100 simulations for each DGP, and a time series of 25,000 observations in each simulation.Model performance is measured by the RMSE from the true ξt and δt in each draw.series data is here generated as y t ∼t(0, σ t , α t ), where α −1 t 0.5 + 0.3 sin(4π t/T) and σ t = 1 + 0.5 sin(16π t/T).This is Path 3 in Section 3.1.Pseudo-true parameter are reported in solid red.The four panels report estimates of ξ t , δ t , VaR t , and ES t , respectively.Median filtered values are plotted in solid black.The first two panels also indicate the lower 5% and upper 95% quantiles of the estimates (black dots).The time-varying threshold τt is estimated based on the recursive specification (10) in conjunction with the objective function (18).moving from a GPD to a Student's t density in the data generating process of y t , and second, how accurately can we recover high-confidence market risk measures when the conditional GPD density is only approximately correct.
Table 1 presents the corresponding results.It reports RMSE statistics for tail shape ξt and tail scale δt .Figure 3  We focus on three main findings.First, all models seem to work well in recovering the true underlying ξ t and δ t dynamics.The median estimates in Figure 3 tend to be close to their (pseudo)-true values.The full results in in Web Appendix H.1 confirm this.Even the highly nonlinear patters of δ t are recovered well.The model also captures the peaks of ξ t , which correspond to the episodes with the fattest tails.The model needs some time to recognize that the extreme tail has become more benign, that is, that ξ t has gone down.The good fit is corroborated by Table 1.We also note that both estimation methods for τ t only loose about 10% RMSE for ξ t and δ t compared to the use of the true (infeasible) τ t .
Second, when comparing the results for ξ t and δ t based on the recursive estimate τt and the dynamic estimate τ * t of Patton, Ziegel, and Chen (2019), Table 1 shows differences are mostly small and insignificant.If there is no time-variation (Path 1), the estimates of δt based on a recursive τt fare slightly better (as expected).The converse is true for if the true parameters vary over time (Paths 2-4).
the statements above (6).Parameters b ξ and b δ are estimated to be close to one for both series, implying that shocks to each time-varying parameter die out only slowly.A numerical check reveals that the deterministic parameters a ξ and b ξ lie within the SE region implied by the sufficient conditions of Theorems 1 and 2. A diagnostic check in Web Appendix J suggests that the bivariate filter (4) is also invertible at these estimates.

Tail Parameter Estimates
Figure 4 presents the raw log-returns (top panels) along with filtered estimates of ξ t and δ t (middle and bottom panels).The filtered tail shape varies between approximately 0.05 and 0.25 for the S&P 500 index, and between approximately 0.05 and 0.35 for IBM.The filtered tail scales vary roughly between 0.2 and 2.0 for the S&P 500 and IBM.The confidence bands around each filtered parameter suggest that both are reasonably precisely estimated, and that the tail shape parameter is often far from zero.The reported confidence bands are conditional on the estimated thresholds τt .
The filtered estimates of ξ t and δ t suggest that one-off, extremely negative returns affect the filtered tail shape more than the filtered tail scale.Longer-lasting crises, by contrast, appear to affect the tail scale more than the tail shape.For example, the stock market crash on October 19, 1987, also known as "Black Monday, " considerably increases ξt but not δt .Similarly, the 2010 "flash-crash" on May 6, 2010 increases ξt more than δt .By contrast, the global financial crisis between 2008 and 2009, and the Covid-19 pandemic recession in early 2020, both temporarily increase δt while leaving ξt less affected.
Figure 4 also plots estimates of VaR t and ES t at a 99% confidence level.We checked that indeed 1.0% of the log-returns lie beyond VaR t in the case of the S&P 500 index (0.9% for IBM stock).The average value of −y t conditional on it exceeding its VaR is −3.54% for the S&P 500, and −4.85% for IBM.These values are approximately in line with the time series average of ES t at −3.12% for the S&P 500, and −4.59% for IBM, respectively.

Changes in Sovereign Yields
In our second application we illustrate the inclusion of explanatory variables in the dynamics of ξ t and δ t as in (9).To do so, we study whether there was any tail risk impact of central bank asset purchases on changes in Italian (IT) and Portuguese (PT) fiveyear bond yields between 2010 and 2012.Both Italy and Portugal were in the "epicenter" of the existential euro area sovereign debt crisis at that time; see for example, Eser andSchwaab (2016), andGhysels et al. (2017).Italy is an example for a large euro area country that was affected by the crisis relatively late (in mid-2011), and that benefited from Eurosystem bond purchases only during a relatively short period of time, between August 2011 and March 2012.Portugal, by contrast, is an example for a smaller euro area country that was affected relatively early (already in 2010), and that benefited from Eurosystem bond purchases more uniformly over time, between May 2010 and March 2012.
Eurosystem bond purchases undertaken during the sovereign debt crisis predominantly targeted the 1-to 10-year maturity bracket, with the 5-year maturity approximately in the middle of that spectrum; see for example Eser and Schwaab (2016).
We consider the impact on five-year sovereign benchmark bond yields for this reason.The bond yields y t are sampled at the 15-min frequency, between 9 a.m. and 5 p.m., and are obtained from Bloomberg.We do not consider changes in yield, that the first 15-min interval covers a.m. to 9:15 a.m.Our sample ranges from January 04, 2010 to December 31, 2012.This yields 32 intra-daily observations per trading day, with T ≈ 3 × 260 × 32 ≈ 25,000 per country.
Finally, we construct time data z of country-specific bond purchases at the high (15-min) Observations z t contain all sovereign bond purchases at par (nominal) between t − 1 and t for the respective country, not only purchases of the five-year benchmark bond.Disaggregated data on Eurosystem SMP purchases at a high-frequency are still confidential at the time of writing.At the end of our sample, the Eurosystem held e99.0bn in Italian sovereign bonds and e21.6 bn in Portuguese bonds; see the ECB Annual Report 2013.We including these as an additional conditioning variable in (9) to see whether they mitigated extreme tail behavior or not.

Deterministic Parameter Estimates
We continue to rely on the time-variation in the thresholds τ t to control for time-variation in any parameters describing the center of the distribution.The thresholds now evolve according to (11), also taking account of the SMP purchases z t .We choose (1 − κ) = 10%, and initialize τ 1 = q0.9, the 90% empirical quantile of y t .
Analyzing changes in the tail shape and tail scale parameters for high-frequency data is challenging given the high persistence of at such frequencies.We therefore introduce two simplifications.First, we fix the smoothing parameter λ at 0.05 1/32 ≈ 0.911, such that 95% of the smoothing materializes within one day (i.e., 32 15-min intervals).As there are only loglikelihood contribution for the GPD for x t = z t − τ t > 0, it is hard to identify λ empirically.Second, to keep the model in line with the theory from Section 2.2, we fix b ξ and b δ to a value close to the unit boundary, as otherwise they are estimated indistinguishably different from 1.We choose (b ξ ) 32 = (b δ ) 32 = 0.9997, which implies a similar persistence level as (IBM) equity returns at a daily (= 8h = 32 quarters) frequency.Fixing λ, b ξ , or b δ at other reasonable values has little effect on the empirical findings.
Columns three to six of Table 2 present our estimates of the deterministic parameters for the model ( 1)-( 11).Columns three and five refer to a baseline model without central bank purchases.Table 2 reports bootstrapped standard errors for the deterministic parameters using the procedure outlined in Web Appendix K.
The estimates of a τ > 0 and b τ < 1 suggest that the thresholds are time-varying and mean-reverting.Parameter c τ is Top panels: daily log-returns for the S&P500 index (left) and IBM common stock (right).Middle and bottom panels: filtered tail shape (ξ t , middle) and tail scale t parameters.The thresholds τ t are reported at a 90% confidence level; Value-at-Risk (VaR) and Expected Shortfall (ES) are plotted at an extreme 99% confidence level (top panels).The thresholds τ t , VaR, and ES are mirrored at the horizontal axis to correspond to log-returns (instead of percentage losses).The estimation samples range from July 3, 1962 to December 31, 2020 for the S&P500 index, and from January 2, 1926 to December 31, 2020 for the IBM stock.The reported samples range from July 3, 1962 to December 31, 2020.estimated to be negative for Portuguese bonds, and close to zero for Italian bonds.Parameters a ξ and a δ suggest pronounced and statistically significant time series variation in both the tail shape ξ t and tail scale δ t parameters, both of which are captured by our time-varying parameter model.The impact parameters c ξ and c δ of bond purchases on tail shape and scale are estimated to be negative in both cases.The loglikelihood increases by 21.9 points for IT, and 37.0 points for PT.A comparison of model selection criteria (AIC, BIC) further supports the inclusion of central bank asset purchases as a useful covariate to explain each time series' extreme tail dynamics.

Tail Parameter Estimates and VaR Impact
Figure 5 plots the corresponding filtered estimates for timevarying tail shape ξ t and tail scale δ t .Time series variation is present and pronounced in both parameters.The filtered tail shape varies between approximately 0.1 and 0.4 for Italian yields, and between 0.05 and 0.6 for Portuguese yields.The filtered tail scale varies between approximately 1 and 10.0 for Italian yields, and between approximately 1 and 40 for Portuguese yields.The standard error bands around each time-varying parameter suggest that both parameters are estimated reasonably precisely, and that the tail shape is often far from the Gumbel case of ξ t = 0.As was clear from Table 2, the coefficients c ξ and c δ measure the impact of bond purchases on the tail behavior of yield changes.As these parameters are difficult to interpret by themselves, we show in Web Appendix L how they can be translated into an impact on VaR via their link to τ t , δ t , and ξ t .For Italy, we obtain a total VaR impact of 0.0 (tail threshold) +0.8 (tail scale) − 5.9 (tail shape) = −5.1 bps for a e1 bn Eurosystem intervention.For Portugal, we obtain a larger impact of −0.1 (tail threshold) − 172.5 (tail scale) − 176.4 (tail shape) = −349.0bps.These point estimates are of course subject to substantial estimation uncertainty; see Table 2.The 95% confidence intervals for VaR impact can be bootstrapped along with the parameters.They equal [−17.6, 11.6] for  for PT.The stronger impact for Portugal than for Italy is likely due to a e1 bn intervention constituting a larger share of the overall market.

Conclusion
We introduced a semiparametric modeling framework to study time variation in tail parameters for long univariate time series.To this end we modeled the time variation in the shape and scale parameters of the Generalized Pareto Distribution, which approximates the tail of most heavy-tailed densities used in econometrics and the actuarial sciences.We discussed the handling of nontail time series observations, inference on deterministic and time-varying parameters, and how to relate tail variation to observed covariates if such variables are available.We also established conditions for stationarity and ergodicity of the model and conditions for consistency and asymptotic normality of the maximum likelihood estimator.The model therefore complements and extends recent work based on different methodologies, such as the nonparametric approach to tail index variation of de Haan and Zhou (2021), the time-varying quantile (and ES) approaches of Patton, Ziegel, and Chen (2019) and Catania and Luati (in press), and the parametric approach of Massacci (2017).We applied the model to study variation in the left tail of U.S. equity log-returns, and in the right tail of changes in euro area sovereign bond yields at a high frequency.In the latter case we also studied the impact of Eurosystem bond purchases, concluding that these had a beneficial impact on tail parameters, leaning against the risk of extremely adverse market outcomes.Evidently, our model for time-varying tail parameters is focussed on capturing marginal features.In many applications it may also be of interest to study the time-varying nature of joint extremes; see for example, Castro-Camilo, de Carvalho, and Wadsworth (2018), Escobar-Bach et al. (2018), and Mhalla, de Carvalho, and Chavez-Demoulin (2019).In terms of our first illustration, for example, one could wonder whether extremely negative log-returns for the S&P 500 and IBM stock were more dependent at certain points in time.We leave such research for future work (see also Lucas, Schwaab, and Zhang 2014;Oh and Patton 2018;Hautsch and Herrera 2020).

Figure 2 .
Figure 2. Stationarity and moments regions.Top row: the combinations of a ξ and b ξ that satisfy A2 (left panel), and (15) for p = 2 (middle panel) and p = 4 (right panel), for an increasing number of iterations r ≥ 1. Bottom row: similar, but for combinations of a δ and b δ instead.Other deterministic parameters for each panel are fixed at their empirical estimates for S&P500 returns; see Section 4.1.
provides a representative example of the simulation outcomes where we compare median estimated parameter paths for ξt , δt , VaR 0.99 , and ES 0.99 to their (pseudo)-true values.Full results are found in Web Appendix H.1.The true parameters in Figure 3 follow Path 3 from Section 3.1, and time-varying thresholds are estimated based on the recursive specification (10) and objective function (18).

Figure 4 .
Figure 4. tail parameters for S&P500 (left) and IBM (right) log-returns.Top panels: daily log-returns for the S&P500 index (left) and IBM common stock (right).Middle and bottom panels: filtered tail shape (ξ t , middle) and tail scale t parameters.The thresholds τ t are reported at a 90% confidence level; Value-at-Risk (VaR) and Expected Shortfall (ES) are plotted at an extreme 99% confidence level (top panels).The thresholds τ t , VaR, and ES are mirrored at the horizontal axis to correspond to log-returns (instead of percentage losses).The estimation samples range from July 3, 1962 to December 31, 2020 for the S&P500 index, and from January 2, 1926 to December 31, 2020 for the IBM stock.The reported samples range from July 3, 1962 to December 31, 2020.

Figure 5 .
Figure 5.The tail shape and tail scale estimates.Top row: Five-year sovereign benchmark bond yields for Italy (IT, left column) and Portugal (PT, right column) 2010 and 2012.Middle row: filtered tail shape (ξ t ) parameter.Bottom row: filtered tail scale (δ t ) parameter.Standard error bands are simulated at a 95% confidence level.

Table 1 .
RMSE outcomes for DGP1.Root mean squared error (RMSE) statistics for two different distributions (GPD and t, in columns) and for four different parameter paths for tail shape ξ t and tail scale δ t (paths (1)-(

Table 2 .
Parameter estimates.Parameter estimates for the dynamic tail shape model.The second and third columns refer to the first application (equity log-returns of the S&P500 index and IBM stock).The estimation samples range from July 3, 1962 to December 31, 2020 for the S&P500 index, and from January 2, 1926 to December 31, 2020 for IBM stock.The remaining columns refer to the second application (changes in sovereign yields).Columns labeled IT 5y and PT 5y refer to yield changes for Italy and Portuguese five-year benchmark bonds sampled at the 15-min frequency.The estimation samples range from January 4, 2010, 9 a.m., to December 31, 2012, 5 p.m.For the second application, b ξ and b δ are estimated indistinguishably different from, but still slightly below one.To be in line with theory, we fix the coefficients slightly below 1 at 0.9997 a for a = 1/32, such that at a daily (8 hr = 32 quarters) the coefficient is comparable to the highest IBM coefficient, b ξ .Standard error estimates are in round brackets, p-values are in square brackets.Standard errors and p-values are based on a sandwich covariance matrix estimator for the first application, and on a bootstrap procedure for the second application.