Fast and Flexible Bayesian Inference in Time-varying Parameter Regression Models

In this paper, we write the time-varying parameter (TVP) regression model involving K explanatory variables and T observations as a constant coefficient regression model with KT explanatory variables. In contrast with much of the existing literature which assumes coefficients to evolve according to a random walk, a hierarchical mixture model on the TVPs is introduced. The resulting model closely mimics a random coefficients specification which groups the TVPs into several regimes. These flexible mixtures allow for TVPs that feature a small, moderate or large number of structural breaks. We develop computationally efficient Bayesian econometric methods based on the singular value decomposition of the KT regressors. In artificial data, we find our methods to be accurate and much faster than standard approaches in terms of computation time. In an empirical exercise involving inflation forecasting using a large number of predictors, we find our models to forecast better than alternative approaches and document different patterns of parameter change than are found with approaches which assume random walk evolution of parameters.


Introduction
Time-varying parameter (TVP) regressions and Vector Autoregressions (VARs) have shown their usefulness in a range of applications in macroeconomics (e.g., Cogley and Sargent, 2005;Primiceri, 2005;D'Agostino et al., 2013). Particularly when the number of explanatory variables is large, Bayesian methods are typically used since prior information can be essential in overcoming over-parameterization concerns. These priors are often hierarchical and ensure parsimony by automatically shrinking coefficients. Examples include Belmonte et al. (2014), Kalli and Griffin (2014), Bitto and Frühwirth-Schnatter (2019) and Huber et al. (2021). Approaches such as these have two characteristics that we highlight so as to motivate the contributions of our paper. First, they use Markov Chain Monte Carlo (MCMC) methods which can be computationally demanding. They are unable to scale up to the truly large data sets that macroeconomists now work with. Second, the regression coefficients in these TVP models are assumed to follow random walk or autoregressive (AR) processes. In this paper, we develop a new approach which is computationally efficient and scaleable. Furthermore, it allows for more flexible patterns of time variation in the regression coefficients.
We achieve the computational gains by writing the TVP regression as a static regression with a particular, high dimensional, set of regressors. Using the singular value decomposition (SVD) of this set of regressors along with conditionally conjugate priors yields a computationally fast algorithm which scales well in high dimensions. One key feature of this approach is that no approximations are involved. This contrasts with other computationally-fast approaches to TVP regression which achieve computational gains by using approximate methods such as variational Bayes (Koop and Korobilis, 2018), message passing (Korobilis, 2021) or expectation maximization (Rockova and McAlinn, 2021).
Our computational approach avoids large-scale matrix operations altogether and exploits the fact that most of the matrices involved are (block) diagonal. In large dimensional contexts, this allows fast MCMC-based inference and thus enables the researcher to compute highly nonlinear functions of the time-varying regression coefficients while taking parameter uncertainty into account. Compared to estimation approaches based on forward-filtering backward-sampling (FFBS, see Carter and Kohn, 1994;Frühwirth-Schnatter, 1994) algorithms, the computational burden is light. In particular, we show that it rises (almost) linearly in the number of covariates.
For quarterly macroeconomic datasets that feature a few hundred observations, this allows us to estimate and forecast, exploiting all available information without using dimension reduction techniques such as principal components.
Computational tractability is one concern in high dimensional TVP regressions. The curse of dimensionality associated with estimating large dimensional TVP regressions is another.
To solve over-parameterization issues and achieve a high degree of flexibility in the type of coefficient change, we use a sparse finite mixture representation (see Malsiner-Walli et al., 2016) for the time-varying coefficients. This introduces shrinkage on the amount of time variation by pooling different time periods into a (potentially) small number of clusters. We also use shrinkage priors which allow for the detection of how many clusters are necessary. Shrinkage towards the cluster means is then introduced by specifying appropriate conjugate priors on the regression coefficients. At a general level, this model is closely related to random coefficient models commonly used in microeconometrics (see, e.g., Allenby et al., 1998;Lenk and DeSarbo, 2000). We propose three different choices for this prior. The first of these is based on Zellner's gprior (Zellner, 1986). The second is based on the Minnesota prior (Doan et al., 1984;Litterman, 1986) and the final one is a ridge-type prior (see, e.g., Griffin and Brown, 2013). As opposed to a standard TVP regression which assumes that the states evolve smoothly over time, our model allows for abrupt changes (which might only happen occasionally) in the coefficients.
This resembles the behavior of regime switching models (see, e.g., Hamilton, 1989;Frühwirth-Schnatter, 2001). Compared to those, our approach has two additional advantages: it remains agnostic on the precise law of motion of the coefficients, and it endogenously finds the number of regimes. 1 We investigate the performance of our methods using two applications. Based on synthetic data, we first illustrate computational gains if K and T become large. We then proceed to show that our approach effectively recovers key properties of the data generating process. In a real-data application, we model US inflation dynamics. Our framework provides new insights on how the relationship between unemployment and inflation evolves over time. Moreover, in an extensive forecasting exercise we show that our proposed set of models performs well relative to a wide range of competing models. Specifically, we find that our model yields precise point and density forecasts for one-step-ahead and four-step-ahead predictions. Improvements in forecast accuracy are especially pronounced during recessionary episodes.
The remainder of the paper is structured as follows. Section 2 introduces the static representation of the TVP regression model while Section 3 shows how the SVD can be used to speed up computation. Section 4 provides an extensive discussion of our prior setup. The model is then applied to synthetic data in Section 5 and real data in Section 6. Finally, the last section summarizes and concludes the paper and the Online Appendix provides additional details on computation and further empirical findings.

A Static Representation of the TVP Model
Let {y t } T t=1 denote a scalar response variable 2 that is described by a TVP regression given by where x t is a K-dimensional vector of regressors, β t is a set of K time-varying regression coefficients and σ 2 is the error variance. For now, we assume homoskedastic errors, but will relax this assumption later in the paper.
The TVP regression can be written as a static regression model as follows: Equation 2 implies that the dynamic regression model in Equation 1 can be cast in the form of a standard linear regression model with KT predictors stored in a T × KT -dimensional design matrix Z. Notice that the rank of Z is equal to T and inverting Z Z is not possible. We stress that, at this stage, we are agnostic on the evolution of β t over time. A common assumption in the literature is that the latent states evolve according to a random walk. Such behavior can be achieved by setting φ t = x t for all t, implying a lower triangular matrix Z.
for all t, we obtain a block-diagonal matrix Z which, in combination with a Gaussian prior on β would imply a white noise state equation.
The researcher may want to investigate whether any explanatory variable has a time-varying, constant or a zero coefficient. In such a case, it proves convenient to work with a different parameterization of the model which decomposes β into a time-invariant (γ) and a time-varying part (β): with X = (x 1 , . . . , x T ) denoting a T × K matrix of stacked covariates and β t = γ +β t , with  Giordani and Kohn, 2008), and allows the model to decide the points in time when it is necessary to allow for parameter change.
In the theoretical discussion which follows, we will focus on the time-varying part of the regression model: since sampling from the conditional posterior of γ (under a Gaussian shrinkage prior and conditional on Zβ) is straightforward. In principle, any shrinkage prior can be introduced on γ.
In our empirical work, we use a hierarchical Normal-Gamma prior of the form: where γ j is the j th element of γ for j = 1, . . . , K. We set ϑ = 0.1 and a 0 = a 1 = 0.01 and use MCMC methods to learn about the posterior for these parameters. The relevant posterior conditionals are given in Griffin and Brown (2010) and Section A of the Online Appendix. 3 In the case of lower triangular Z, theβt's can be interpreted as the shocks to the latent states with the actual value of the TVPs in time t given by t s=1β s.

The Homoskedastic Case
In static regressions with huge numbers of explanatory variables, there are several methods for ensuring parsimony that involve compressing the data. Traditionally principal components or factor methods have been used (see Stock and Watson, 2011). Random compression methods have also been used with TVP models (see Koop et al., 2019).
The SVD of our matrix of explanatory variables, Z, is whereby U and V are orthogonal matrices and Λ denotes a diagonal matrix with the singular values, denoted by λ, of Z as diagonal elements.
The usefulness and theoretical soundness of the SVD to compress regressions is demonstrated in Trippe et al. (2019). They use it as an approximate method in the sense that, in a case with K regressors, they only use the part of the SVD corresponding to the largest M singular values, where M < K. In such a case, their methods become approximate.
In our case, we can exploit the fact that rank(Z) = T (T KT ) and utilize the SVD of Z as in Trippe et al. (2019). But we do not truncate the SVD using only the M largest singular values, instead we use all T of them. But since the rank of Z is T ( KT ), our approach translates into an exact low rank structure implying no loss of information through the SVD.
Thus, using the SVD we can exactly recover the big matrix Z. The reason for using the SVD instead of Z is that we can exploit several convenient properties of the SVD that speed up computation. To be specific, if we use a Gaussian prior, this leads to a computationally particularly convenient expression of the posterior distribution ofβ which avoids complicated matrix manipulations such as inversion and the Cholesky decomposition of high-dimensional matrices. Hence, computation is fast.
We assume a conjugate prior of the form: with D 0 = I T ⊗Ψ being a KT -dimensional diagonal prior variance-covariance matrix, where I T denotes a T -dimensional identity matrix and Ψ a K × K-dimensional diagonal matrix that contains covariate-specific shrinkage parameters on its main diagonal. Our prior will be hierarchical so that Ψ will depend on other prior hyperparameters θ to be defined later.
Using textbook results for the Gaussian linear regression model with a conjugate prior (conditional on the time-invariant coefficients γ), the posterior is In conventional regression contexts, the computational bottleneck is typically the KT × KT matrix Vβ. However, with our SVD regression, Trippe et al. (2019), show this to take the form: with denoting the dot product. Crucially, the matrix , assume a ridge-type prior) the matrix Ξ again reduces to a diagonal matrix. 4 The main computational hurdle boils down to computing V ΞV , but, for a blockdiagonal Z it is a sparse matrix and efficient algorithms can be used. In case we use a lower triangular Z coupled with a ridge-prior, computation can be sped up enormously by noting The resulting computation time, conditional on a fixed T , rises approximately linearly in K because most of the matrices involved are (block) diagonal and sparse. The key feature of our algorithm is that we entirely avoid inverting a full matrix. The only inversion involved is the one of Ξ which can be carried out in O(T ) steps.
To efficiently simulateβ ∼ N (µβ, σ 2 Vβ) using Equation 5, we exploit Algorithm 3 proposed in Cong et al. (2017). In the first step, this algorithm samples a ∼ N (0 T K , D 0 ) and b ∼ N (0 T , diag (λ λ) −1 ). In the second step, a valid draw ofβ is obtained by computingβ = Step 2 is trivial since Ξ is diagonal for a block-diagonal Z and also for a lower triangular Z with a ridge-prior. Hence, sampling ofβ is fast and scalable to large dimensions.
In this sub-section, we have described computationally efficient methods for doing Bayesian estimation in the homoskedastic Gaussian linear regression model when the number of explanatory variables is large. They can be used in any Big Data regression model, but here we are using them in the context of our TVP regression model written in static form as in Equation 3.
These methods involve transforming the matrix of explanatory variables using the SVD. If the matrices of prior hyperparameters, b 0 and D 0 , were known and if homoskedasticity were a reasonable assumption, then textbook, conjugate prior, results for Bayesian inference in the Gaussian linear regression model are all that is required. Analytical results are available for this case and there would be no need for MCMC methods. This is the case covered by Trippe et al. (2019). However, in macroeconomic data sets, homoskedasticity is often not a reasonable assumption. And it is unlikely that the researcher would be able to make sensible choices for b 0 and D 0 in this high-dimensional context. Accordingly, we will develop methods for adding stochastic volatility and propose a hierarchical prior for the regression coefficients.

Adding Stochastic Volatility
Stochastic volatility typically is an important feature of successful macroeconomic forecasting models (e.g., Clark, 2011). We incorporate this by replacing σ 2 in Equation 3 with Σ = diag(σ 2 1 , . . . , σ 2 T ) ⊗ I K . This implies that the prior onβ is Note that the prior in a specific period is given bỹ with b 0t being the relevant block associated with the t th period. Thus, it can be seen that the degree of shrinkage changes with σ 2 t , implying less shrinkage in more volatile times. From a computational perspective, assuming that σ 2 t scales the prior variances enables us to factor Σ out of the posterior covariance matrix and thus obtain computational gains because D 0 does not need to be updated for every iteration of the MCMC algorithm. From an econometric perspective, the feature that shrinkage decreases if error volatilities are large implies that, in situations characterized by substantial uncertainty, our approach naturally allows for large shifts in the TVPs and thus permits swift adjustments to changing economic conditions. Our forecasting results suggest that this behavior improves predictive accuracy in turbulent times such as the global financial crisis.
We assume that h t = log(σ 2 t ) follows an AR(1) process: In our empirical work, we follow Kastner and Frühwirth-Schnatter (2014) and specify a Gaussian prior on the unconditional mean µ h ∼ N (0, 10), a Beta prior on the (transformed) persistence parameter ρ h +1 2 ∼ B(25, 5) and a non-conjugate Gamma prior on the process innovation variance σ 2 h ∼ G(1/2, 1/2). Bayesian estimation of the volatilities proceeds using MCMC methods based on the algorithm of Kastner and Frühwirth-Schnatter (2014). A small alteration to this algorithm needs to be made due to the dependency of the prior ofβ t on σ t (see the Online Appendix for details).

Posterior Computation
Conditional on the specific choice of the prior on the regression coefficients (discussed in the next section) we carry out posterior inference using a relatively straightforward MCMC algorithm.
Most steps of this algorithm are standard and we provide exact forms of the conditional posterior distributions, the precise algorithm and additional information on MCMC mixing in the Online Appendix.
Here it suffices to note that we repeat our MCMC algorithm 30, 000 times and discard the first 10, 000 draws as burn-in.

General Considerations
With hierarchical priors, where b 0 and/or D 0 depend on unknown parameters, MCMC methods based on the full conditional posterior distributions are typically used. In our case, we would need to recompute the enormous matrix Vβ and its Cholesky factor at every MCMC draw.
This contrasts with the non-hierarchical case with fixed b 0 and D 0 where Vβ is calculated once.
Due to this consideration, we wish to avoid using MCMC methods based on the full posterior conditionals.
Many priors, including the three introduced here, have D 0 depending on a small number of prior hyperparameters. These can be simulated using a Metropolis Hastings (MH) algorithm.
With such an algorithm, updating of Vβ only takes place for accepted draws (in our forecasting exercise roughly 30% of draws are accepted). Since priors which feature closed form full conditional posteriors for the hyperparameters imply that Vβ needs to be recomputed for each iteration in our posterior simulator, this reduces computation time appreciably.

The Prior Covariance Matrix
In this paper, we consider three different hierarchical priors forβ. Since our empirical application centers on forecasting inflation, the predictors x t will be structured as follows x t = (y t−1 , . . . , y t−py , d t−1 , . . . , d t−p d , 1) , with d t denoting a set of N exogenous regressors and p y and p d being the maximum number of lags for the response and the exogenous variables, respectively. In what follows, we will assume that p = p y = p d . In principle, using different lags is easily possible.
The first prior is inspired by the Minnesota prior (see Litterman, 1986). It captures the idea that own lags are typically more important than other lags and, thus, require separate shrinkage. It also captures the idea that more distant lags are likely to be less important than more recent ones. Our variant of the Minnesota prior translates these ideas to control the amount of time-variation, implying that coefficients on own lags might feature more timevariation while parameters associated with other lags feature less time-variation. The same notion carries over to coefficients related to more distant lags which should feature less timevariation a priori.
This prior involves two hyperparameters to be estimated: θ = (ζ 1 , ζ 2 ) . These prior hyperparameters are used to parameterize Ψ to match the Minnesota prior variances: on the coefficients associated with y t−l (l = 1, . . . , p) Here, we let [Ψ] ii denote the (i, i) th element of Ψ, d jt refers to the j th element of d t ,σ 2 y ,σ 2 j denotes the OLS variance obtained by estimating an AR(p) model in y t and d jt , respectively.
The hyperpriors on ζ 1 and ζ 2 follow a Uniform distribution: The second prior we use is a variant of the g-prior involving a single prior hyperparameter: θ = ξ. This specification amounts to setting Ψ = ξ × Ω, where Ω is a diagonal matrix with the (i, i) th element being defined as [Ω] ii =σ 2 y /σ 2 j . For reasons outlined in Doan et al. (1984), we depart from using the diagonal elements of (X X) −1 to scale our prior and rely on the OLS variances of an AR(p) model as in the case of the Minnesota-type prior. The third prior is a ridge-type prior which simply sets Ψ = ξ × I K . This specification is used in the case of a lower triangular Z for reasons outlined in Section 3. While being simple, this prior has been shown to work well in a wide range of applications (Griffin and Brown, 2013).
Similar to the Minnesota prior we again use a Uniform prior on ξ in both cases: ξ ∼ U(s 0 , s 1 ).
Since we aim to infer ξ, ζ 1 and ζ 2 from the data we set s 0 = s 0,1 = s 0,2 = 10 −10 close to zero and {s 1 , s 1,1 , s 1,2 } is specified as follows: Here, κ is a constant being less or equal than unity to avoid excessive overfitting in light of large K and T . Since large values of κ translate into excessive time variation inβ t , we need to select κ carefully. The hyperparameters of this prior are inspired by the risk inflation criterion put forward in Foster et al. (1994) which would correspond to setting ξ = 1/K 2 . Since this prior was developed for a standard linear regression model it would introduce too little shrinkage in our framework (or, if we set ξ = 1/(T K) 2 too much shrinkage, ruling out any time-variation). Our approach lets the data speak but essentially implies that the bound of the prior is increasing in T and decreasing in the number of covariates. Intuitively speaking, our prior implies that if the length of the time series increases, the prior probability of observing substantial structural breaks also increases slightly.
In the empirical application, we infer κ over a grid of values and select the κ that yields the best forecasting performance in terms of log predictive scores. Further discussion of and empirical evidence relating to κ (and G) is given in Section C of the Online Appendix.
The methods developed in this paper will hold for any choice of prior covariance matrix, D 0 , although assuming it to be diagonal greatly speeds up computation. In this sub-section, we have proposed three forms for it which we shall (with some abuse of terminology) refer to as the Minnesota, g-prior and ridge-prior forms, respectively, in the following material.

The Prior Mean
As for the prior mean, b 0 , it can take a range of possible forms. The simplest option is to set it to zero. After all, from Equation 3 it can be seen thatβ t measures the deviation from the constant coefficient case which, on average, is zero. This is what we do with the Minnesota prior and if we set Z to be lower triangular. 5 However, it is possible that we can gain estimation accuracy through pooling information across coefficients by adding extra layers to the prior hierarchy. In this paper, we do so using a sparse finite location mixture of Gaussians and adapt the methods In the discussion below, we refer to these two treatments of the prior mean as non-clustered and clustered, respectively. With the g-prior, we consider both clustered and non-clustered approaches.
We emphasize that both of these specifications for the prior mean are very flexible and let the data decide on the form that the change in parameters takes. This contrasts with standard TVP regression models, where it is common to assume that the states evolve according to random walks. This implies that the prior mean of β t is β t−1 .
With the clustered approach, we assume that eachβ t has a prior of the following form: where f N denotes the density of a Gaussian distribution and w are component weights with G g=1 w g = 1 and w g ≥ 0 for all g. µ g (g = 1, . . . , G) denotes G component-specific means with G being a potentially large integer that is much smaller than T (i.e., G T ).
An equivalent representation, based on auxiliary variables δ t , is with Pr(δ t = g) = w g being the probability thatβ t is assigned to group g. Equation 8 can 5 Using the Minnesota prior in combination with the clustering specification introduced in this sub-section is less sensible. That is, its form, involving different treatments of coefficients on lagged dependent variables and exogenous variables and smaller prior variances for longer lag length already, in a sense, clusters the coefficients into groups. A similar argument holds for a lower triangular matrix Z since that would translate into a random walk with a (potentially) time-varying drift term. 6 For a detailed discussion on the relationship between sparse finite mixtures and Dirichlet process mixtures, see Frühwirth-Schnatter and Malsiner-Walli (2019).
be interpreted as a state evolution equation which resembles a hierarchical factor model since eachβ t clusters around the different component means µ g . As opposed to assuming a random walk state evolution, which yields smoothly varying TVPs, this model provides more flexibility by pullingβ t towards G ≤ T prior means. Under the prior in Equation 8, our model can be interpreted as a random coefficients model (for a Bayesian treatment, see, e.g., Frühwirth-Schnatter et al., 2004).
Before proceeding to the exact prior setup, it is worth noting that the mixture model is not identified with respect to relabeling the latent indicators. In the forecasting application, we consider functions of the states which are not affected by label switching. Thus, we apply the random permutation sampler of Frühwirth-Schnatter (2001) to randomly relabel the states in order to make sure that our algorithm visits the different modes of the posterior. In what follows, we define m t = µ g if δ t = g. Using this notation, the prior mean is given by For the weights w = (w 1 , . . . , w G ) , we use a symmetric Dirichlet prior: w|π ∼ Dir(π, . . . , π).
Here, π denotes the intensity parameter that determines how the model behaves in treating superfluous components. If π ≤ K/2, irrelevant components are emptied out while if π > K/2, the model tends to duplicate component densities to handle overfitting issues. This implies that careful selection of π is crucial since it influences the number of breaks inβ t . The literature suggests different strategies based on using traditional model selection criteria or reversible jump MCMC algorithms to infer G from the data. Our approach closely follows Malsiner-Walli et al.
(2016) and uses a shrinkage prior on π. The prior we adopt follows a Gamma distribution: To assess which elements in µ g determine the group membership, we use yet another shrink-age prior on the component means: The prior on υ j (j = 1, . . . , K) follows a Gamma distribution: translating into the Normal-Gamma prior of Griffin and Brown (2010). In the empirical ap- The common variance factor implicitly affects the tightness of the prior and ensures (conditional) conjugacy.
Compared to a standard time-varying parameter model which assumes a random walk state evolution, our prior on β t is invariant with respect to time, up to a scaling factor σ t . If σ t is constant, (β 1 , . . . , β T ) has the same prior distribution as (β ρ(1) , . . . , β ρ(T ) ) for any permutation ρ. In our general case, temporal dependence is not an assumption, but arises through appropriately choosing x t and by allowing for prior dependence on σ t . In the extreme case where x t does not include lagged values of y t (we include several lags of y t in our empirical work) and σ t is constant, the dynamic nature of the model is lost since the model is invariant to reordering the time series with respect to t and no dependency is imposed.

Illustration Using Artificial Data
In this section we illustrate our modeling approach that utilizes the g-prior and clustering by means of synthetic data simulated from a simple data generating process (DGP).
We begin by illustrating the computational advantages arising from using the SVD, relative to a standard Bayesian approach to TVP regression which involves random walk evolution of the coefficients and the use of FFBS as well as a model estimated using the precision sampler all without a loop (AWOL, see Chan and Jeliazkov, 2009;McCausland et al., 2011;Kastner and Frühwirth-Schnatter, 2014). Figure 1(a) shows a comparison of the time necessary to generate a draw from p(β|Data, γ, σ 2 ) using our algorithm based on the SVD, the FFBS algorithm and the AWOL sampler as a function of K ∈ {1, 2, . . . , 150} and for T = 200. 8 To illustrate how computation times change with T , Figure 1 In panel (a), we fit a (non-)linear trend on the empirical estimation times of the different approaches. This implies that while the computational burden is cubic in the number of covariates K for the FFBS approach, our technique based on using the SVD suggests that runtimes increase (almost) linearly in K.
Notice that the figure clearly shows that traditional algorithms based on FFBS quickly become infeasible in high dimensions. Up to K ≈ 50, our algorithm (for both choices of Z) is slightly slower while the computational advantage increases remarkably with K, being more than four times as fast for K = 100 and over nine times as fast for K = 150.
When we compare the SVD to the AWOL algorithm we also observe sizeable improvements in estimation times. For K = 150, our proposed approach is almost four times faster. This performance is even more impressive given that our SVD approach is implemented in R, a high level interpreted language, while both FFBS and AWOL are efficiently implemented in Rcpp (Eddelbuettel et al., 2011).
Panel (b) of the figure shows that, for fixed K, computation times increase linearly for most approaches if T is varied. The main exception is the case of a lower triangular Z, with computation times growing non-linearly in T . This is because this approach relies on several non-sparse matrix-vector products. Since T is typically moderate in macroeconomic data this does not constitute a main bottleneck of the algorithm for general matrices Z. It is, moreover, noteworthy that the slope of the line referring to FFBS is steeper than the ones associated with the SVD (for block-diagonal Z) and AWOL approaches. This reflects the fact that one needs to perform a filtering (that scales linearly in T ) and smoothing step (that is also linear in T ).
This brief discussion shows that the SVD algorithm scales well and renders estimation of huge dimensional models feasible. Notes: The figure shows the actual and theoretical time necessary to obtain a draw ofβ using our proposed SVD algorithm for Z being block-diagonal and lower triangular, an AWOL sampler (implemented in R through the shrinkTVP package of Knaus et al., 2021) and the FFBS algorithm. The dashed red lines refer to the SVD approach with a lower triangular Z and a ridge-prior, the orange dashed line refers to the SVD algorithm with block-diagonal Z, the dashed green lines refer to the AWOL sampler, while the dashed blue lines indicate the FFBS. The dots refer to theoretical run times. Here, we fit a non-linear trend on the empirical estimation times.
We now assume that y t is generated by the following DGP: .β t depends on m t which evolves according to the following law of motion: with I(•) being the indicator function that equals 1 if its argument is true.
Analyzing this stylized DGP allows us to illustrate how our approach can be used to infer In what follows, we simulate a single path of y t and use this for estimating our model. We estimate the model using the g-prior with clustering and set G = 12. In this application, we show quantities that depend on the labeling of the latent indicators. This calls for appropriate identifying restrictions and we introduce the restriction that µ 1 < · · · < µ G . This is not necessary if interest centers purely on predictive distributions and, thus, we do not impose this restriction in the forecasting section of this paper.
Before discussing how well our model recovers the true state vectorβ t , we show how our modeling approach can be used to infer the number of groups G. Following Malsiner-Walli et al.
(2016), the number of groups is estimated during MCMC sampling as follows: g denoting the number of observations in cluster g for the j th MCMC draw. This yields a posterior distribution for G 0 . Its posterior mode can be used as a point estimate of G.
In Table 1, we report the posterior probability of a given number of regimes by simply computing the fraction of draws with G 0 = g for g = 1, . . . , 12. The table suggests that the probability that G 0 = 4 is around 66 percent. This indicates that our algorithm successfully selects the correct number of groups, since the mode of the posterior distribution equals four.
It is also worth noting that the posterior mean of π is very small at 0.09, suggesting that our mixture model handles irrelevant components by emptying them instead of replicating them (which would be the case if π becomes large). Notice, however, that G 0 = 5 also receives some posterior support. We have a probability of about 26 percent associated with a too large number of regimes. In the present model, this slight overfitting behavior might be caused by additional noise driven by the shocks to the statesβ t , with our mixture model trying to fit the noise.
Next, we assess whether our model is able to recoverβ t and m t . Figure  Considering panel (b) of Figure 2 reveals a similar picture. Our approach yields credible sets that include the actual outcome of m t for all t. This discussion shows that our model also handles cases with infrequent breaks in the regression coefficients rather well. As compared to standard TVP regressions that imply a smooth evolution of the states, using a mixture model to determine the state evolution enables us to capture large and abrupt breaks. The previous discussion has shown that our model works well if the DGP is characterized by relatively few breaks. In the next step, we test the model under a less favourable DGP: we  assume that the law of motion ofβ t is a random walk with a state innovation variance of one andβ 0 = 3. The results are shown in Figure 3 and Table 2. Panel (a) shows that even when the DGP is characterized by many small breaks, our model is flexible enough to capture this behavior as well. This is because we essentially pool coefficients but also allow for idiosyncratic (i.e., time-specific) deviations from the common mean. If we consider panel (b) we observe that the mean process m t captures the bulk of the variation inβ t . Table 2 suggests that even if we set G = 30, the sparse finite mixture allocates substantial posterior mass to lower values of G (with values of G between 15 and 21), but still is able to retrieve over 80 percent of the posterior mass. Hence, even if the true DGP is a random walk and G is much smaller than T , our approach accurately recovers the full history of the latent states.
6 An Application to US Inflation 6.1 Data and selected in-sample features Modeling and forecasting inflation is of great value for economic agents and policymakers. In For the reasons above, inflation forecasting is an ideal empirical application in which we can investigate the performance of our methods. An important criterion is the capacity of our approach to generalize standard TVP models, which are less flexible because they are based on random walk or autoregressive specifications to determine the evolution of the states. A second challenge is the correct detection of well-known structural breaks. In addition, we assess the forecasting performance of our methods relative to alternative approaches.
Following Stock and Watson (1999), we define the target variable as follows with P t+h denoting the price level (CPIAUCSL) in period t + h. Using this definition, we estimate a generalized Phillips curve involving 49 covariates plus the lagged value of y t that cover different segments of the economy. Further information on the specific variables included and the way they are transformed is provided in Section B of the Online Appendix. The design matrix x t includes p = p y = p d = 2 lags and an intercept and thus features K = 101 covariates.
Before we use our model to perform forecasting, we provide some information on computation times, illustrate some in-sample features of our model and briefly discuss selected posterior estimates of key parameters. Table 3 shows empirical runtimes (in minutes) for estimating the different models for this large dataset. As highlighted in the beginning of Section 5, our approaches start improving upon FFBS-based algorithms in terms of computation time if K exceeds 50, with the improvements increasing non-linearily in K. Hence, it is unsurprising that, for our present application with K = 101, our algorithm (without clustering) is almost five times faster than using FFBS and twice as fast as the efficient AWOL sampler. If clustering is added, our approach is still more than three times faster than FFBS. The additional computational complexity from using the clustering prior strongly depends on G. If G is close to T (which typically does not occur in practice and we thus do not consider this case), then the computation time increases and the advantage of using the SVD is diminished. This arises since estimating the location parameters of the mixtures becomes the bottleneck in our MCMC algorithm. Finally, using a random walk state evolution equation (i.e., a lower triangular Z) with a ridge-prior yields the strongest gains in terms of computational efficiency, being almost six times faster than FFBS and over twice as fast than the AWOL sampler.
To further illustrate the properties of the estimated parameters in our SVD approach using the g-prior with clustering we now turn to a small-scale model. In this case, the number of coefficients is relatively small and features such as multipliers with an economic interpretation can be easily plotted. This model is inspired by the New Keynesian Phillips curve (NKPC). The dependent variable is inflation and the right hand side variables include two lags of unemployment and inflation. We set G = 30, thus allowing for a relatively large number of clusters.  K and G 0 is to be expected. That is, as model size increases, more of the variation over time can be captured by the richer information set in x t , leaving less of a role for time variation in coefficients. Our clustering algorithm automatically adjusts to this effect.

Forecasting evidence
The forecasting design adopted is recursive. We consider an initial estimation period from 1965Q1 to 1999Q4. The remaining observations (2000Q1 to 2018Q4) are used as a hold-out period to evaluate our forecasting methods. After obtaining h ∈ {1, 4}-step-ahead predictive distributions for a given period in the hold-out, we include this period in the estimation sample and repeat this procedure until we reach the end of the sample. In order to compute longer Notes: Blue shaded areas are 68% credible intervals and gray shaded areas denote NBER recessions.
horizon forecasts, we adopt the direct forecasting approach (see e.g., Stock and Watson, 2002).
To assess forecasting accuracy, we use root mean square forecast errors (RMSEs) for point forecasts and log predictive likelihoods (LPLs, these are averaged over the hold-out period) for density forecasts. We evaluate the statistical significance of the forecasts relative to random walk (RW) forecasts using the Diebold and Mariano (1995) test.
We compare four variants of our SVD approach (i.e., the Minnesota prior, the g-prior with and without clustering and the SVD model with a random walk-type state evolution, labeled With regards to the number of explanatory variables, we consider models with two lags of all 50 of them (labeled FULL in the tables), none of them as well as some specifications which contain a subset of them. To be specific, we present results for all these models using the NKPC specification discussed in the preceding sub-section (labeled NKPC in the tables). We also have versions of the model where the intercept is the only explanatory variable, thus leading to an unobserved components model (labeled UCM in the tables). 10 In addition, we include some simple benchmarks that have been used elsewhere in the literature. These include a constant coefficient AR(2) model, a TVP-AR (2) and an AR (2) augmented with the two lags of the first three principal components of d t (this is labeled PCA3). This model is closely related to the diffusion index model of Stock and Watson (2002). Additionally, we also compress the data to three dimensions using targeted random compressions (labeled TARP, see Mukhopadhyay and Dunson, 2020). For each of these two dimension reduction techniques we also present forecasts for a TVP-RW-FFBS model. All models considered include stochastic volatility. Table 4 contains our main set of forecasting results. Note first that, with some exceptions, the FULL models do best, indicating that there is information in our K = 50 variables useful for inflation forecasting. If we focus on results for the FULL models, it can be seen that, for h = 1 all of the approaches forecast approximately as well as each other. But for h = 4 there are substantial improvements provided by our SVD approaches relative to the competitors. At this forecast horizon, it is interesting to note that the very parsimonious UCM version of the TVP-RW-FFBS provides point forecasts that are almost as good as those provided by the FULL SVD approaches. However, the density forecasts provided by the UCM are appreciably worse than those provided by the SVD approaches. The FULL SVD approaches are also beating approaches based on dimension reduction (PCA, TARP), even if we allow for time-variation in the coefficients for these models.
Comparing the results of our SVD-based models with a block-diagonal Z to the ones which constrain the state evolution (i.e., TVP-RW-FFBS, TVP-fac-FFBS and TVP-RW-SVD) sheds light on how much the increased flexibility improves forecasting accuracy. In terms of one-stepahead forecasts we find that our flexible approaches yield very similar forecasts to the ones of TVP regressions with random walk state equations. This is consistent with the statement that for short-term forecasting, our model yields forecasts which are competitive to established methods in the literature. When we consider multi-step-ahead forecasts we find pronounced improvements in terms of point and density forecasts for the FULL and NKPC models. Notice the better performance of TVP-fac-FFBS and TVP-RW-SVD relative to TVP-RW-FFBS. In the latter case, this is driven by the ridge-type prior which strongly shrinks the TVPs towards zero whereas in the former case, the better performance can be attributed to the parsimonious factor structure on the TVPs.
With two different forecast horizons and two different forecast metrics, we have four possible ways of evaluating any approach. For three of these, the FULL SVD approach using the g-prior with clustering performs best. The only exception to this is for RMSEs for h = 1, although even here FULL SVD with g-prior is the second best performing approach. The improvements relative to our other SVD approaches which do not involve clustering are small, but are consistently present. This indicates the benefits of the clustering prior.
In general, the TIV approaches do well (for h = 4 even better than TVP-RW-FFBS) in terms of point forecasts, but the density forecasts produced by our SVD approaches are slightly better. This suggests there is only a small amount of time-variation in this data set, but that our SVD approach (particularly when we add the hierarchical clustering prior) is effectively capturing structural breaks in a manner that the random walk evolution of the TVP-RW-FFBS  Notes: The table shows RMSEs with LPL's in parentheses below. Asterisks indicate statistical significance for each model relative to a random walk at the 1 ( * * * ), 5 ( * * ) and 10 ( * ) percent significance levels.
and TVP-RW-SVD cannot. One pattern worth noting is that the benefits of using the FULL model increase after the beginning of the financial crisis. This is true not only for our SVD models, but also for the TIV model. However, notice that during the crisis, the slope of the line associated with the FULL SVD approach becomes steeper, indicating that the model strongly outperforms the RW for that specific time period. This potentially arises from the fact that during recessions, we typically face abrupt structural breaks in the regression parameters and our approach is capable of detecting them. To examine how our model performs in turbulent times we focus on forecast accuracy in the Great Recession. It is worthwhile to keep in mind that inflation was fairly stable through 2008Q3. 2008Q4 and 2009Q1 were the periods associated with a substantial fall in inflation.
Subsequently, inflation became more stable again. Accordingly, it is particularly interesting to look at 2008Q4 and 2009Q1 as periods of possible parameter change. We find that the FULL SVD approach performs comparable to a no-change benchmark model. The simple RW model can be expected to handle a one-off structural break well in the sense that it will forecast poorly for the one period where the break occurs and then immediately adjust to the new lower level of the series. Our FULL SVD approach handles the 2008Q4 and 2009Q1 period about as well as the RW. Subseqeuently, its forecasts improve relative to a RW. This improvement occurs in the middle of the Great Recession for h = 1 and a bit later for h = 4. In contrast, the TVP-RW-FFBS and TVP-RW-SVD models with the large data set experience a big drop in forecasting performance at the beginning of the Great Recession and tend not to outperform the random walk after 2010. However, both do well in late 2009. We conjecture that this pattern of performance reflects two things. First, similarly to our SVD based models which do not constrain the state evolution, both allow for structural breaks, but are slow to adjust to them.
Second, they overfit the data and, thus, provide wide predictive distributions. In the latter half of 2009, after the structural break had occured, when there was still uncertainty about the new pattern in inflation, having this wider predictive distribution benefitted forecast performance.
This discussion provides evidence that our model works well under stressful conditions. The main mechanism driving the strong forecast performance is that the prior variance is allowed to adapt over time and if uncertainty increases (i.e., σ 2 t becomes large), the prior variances increase as well and thus make larger jumps in the parameters more probable.
It can also be seen that our SVD approaches with block diagonal Z tend to perform similarly to one another and never forecast very poorly. This contrasts with the TVP-RW-FFBS and TVP-RW-SVD models which sometimes forecast well, but sometimes yield imprecise forecasts (see, e.g., results for h = 4 using the NKPC data set).
Overall, we find our SVD approaches, and in particular the version that uses the clustering prior, to exhibit the best forecast performance among a set of popular benchmarks. And it is worth stressing that they are computationally efficient and, thus, scaleable. The reason this application uses K = 101 explanatory variables as opposed to a much larger number is due to our wish to include the slower TVP-RW-FFBS approach so as to offer a comparison with the most popular TVP regression model. If we were to have omitted this comparison, we could have chosen K to be much larger.
Finally, a brief word on prior sensitivity is in order. The two key (hyper)parameters of our model are κ and G. In Section C of the Online Appendix we carried out an extensive prior robustness analysis. In this analysis we find that the precise choice of κ plays a limited role for predictive performance unless it is set too large. This statement holds for large models but, to a somewhat lesser extent, also for smaller models. In the smaller models, we find that predictive performance is slightly more sensitive to the choice of κ and the researcher thus has to select this hyperparameter with some care. When it comes to the choice of G, we find that as long as it is not set too small, forecasting accuracy does not change substantially. This finding indicates that our shrinkage prior on π successfully empties out irrelevant clusters if G is large. In an extreme case, that is, if G is set too small a priori, we lose important information on how states evolve over time and this is deleterious for predictive accuracy.

Conclusions
In many empirical applications in macroeconomics, there is strong evidence of parameter change.
But there is often uncertainty about the form the parameter change takes. Conventional approaches to TVP regression models have typically made specific assumptions on how the states evolve over time (e.g., random walk or structural break). In the specification used in this paper, no restriction is placed on the form that the parameter change can take. However, our very flexible specification poses challenges in terms of computation and surmounting overparameterization concerns. We have addressed the computational challenge through using the SVD of the high-dimensional set of regressors. We show how this leads to large simplifications since key matrices become diagonal or have banded forms. The over-parameterization worries are overcome through the use of hierarchical priors and, in particular, through the use of a sparse finite mixture representation for the time-varying coefficients.
In artificial data, we demonstrate the speed and scaleability of our methods relative to standard approaches. In an inflation forecasting exercise, we show how our methods can uncover different forms of time-variation in parameters than other approaches. Furthermore, they forecast well. Since our approach is capable of quickly adjusting to changing economic conditions and outliers, it might also be well suited when applied to macroeconomic forecasting in extreme periods such as the Covid-19 pandemic.

Online Appendix
Fast A Full conditional Posterior Simulation A.1 Full Conditional Posterior Distributions In this section, we provide details on the full conditional posterior distributions of the model described in Section 2. We start by outlining the relevant full conditionals for the time-invariant part of the model. The conditional posterior of the time-invariant coefficients γ follows a multivariate Gaussian distribution: We let σ 2 = (σ 2 1 , . . . , σ 2 T ) denote a T -dimensional vector of volatilities, τ = (τ 1 , . . . , τ K ) stores the K local scaling parameters of the NG prior while the T × K-dimensional matrixX is obtained by stacking the rows of x t and normalizing by dividing each row by σ t .
The full conditional posterior distribution of the global shrinkage parameter is defined as To update θ, irrespective of the prior onβ adopted, we use a RWMH step. Due to the hierarchical nature of the model, the likelihood p(β|Σ, b 0 , D 0 ) does not depend on the data and is given by .4) This conditional distribution is then combined with the appropriate Uniform prior and easy to evaluate since ΣVβ is a diagonal matrix. As a proposal distribution for θ, we use a log-Normal distribution: log θ * = log θ (a) + σ θ ζ, ζ ∼ N (0, I).
Hereby, we let θ * and θ (a) denote the proposed and previously accepted value of θ, respectively.
Moreover, σ θ is a scaling factor that is specified such that the acceptance rate of the MH algorithm is between 20 and 40%. This is achieved by adjusting σ θ over the first 25% of the burn-in stage.
With the clustering prior, the algorithm becomes slightly more complicated and the following steps need to be added.
Notice that it is straightforward to sample from Equation A.7 because all involved quantities are easily vectorized.
Similarly, the full conditional posterior distribution of the common mean µ 0 follows a Gaussian: The full conditional posterior distribution of the elements υ 1 , . . . , υ K of the common variancecovariance is defined as a GIG distribution Here, µ gj (g = 1, . . . , G) denotes the j th element of group-specific means µ g , µ 0j is the j th element of the common mean µ 0 and R j corresponding to the j th element of R.
Finally, a brief word on how to sample the log-volatilities is in order. Since we assume dependence betweenβ t and h t we have to modify the standard algorithm of Kastner and Frühwirth-Schnatter (2014). This is achieved by integrating outβ t analytically. This works irrespective of whether we use a block-diagonal Z and our mixture model or if we assume that the states evolve according to a random walk (i.e., a lower triangular Z). In what follows, we will illustrate our approach under the assumption thatβ t arises from a sparse finite mixture model. Plugging Equation 8 into the t th equation of (3) and rewriting yields: Dividing by ς t yields: which is the observation equation of a SV model and standard algorithms such as the one proposed in Kastner and Frühwirth-Schnatter (2014) can be used.

A.2 Posterior Computation
In this sub-section, we briefly summarize the main steps of the algorithm. Our algorithm cycles between the following steps.  (5) denoting growth rates, defined as log first differences log x t x t−1 . All variables are standardized by substracting the mean and dividing by the standard deviation.

C.1 MCMC Mixing and Convergence Properties
In this sub-section we briefly analyze the mixing properties of our MCMC algorithm for the best performing TVP-SVD model in Table 4, which features the full information set, a pooling prior and κ = 0.05. To this end, we report inefficiency factors (IFs) and Raftery and Lewis (1992) diagnostics for the TVPs and the error volatilities. We focus on these two quantities because these are the ones which we use to set up predictive distributions and, moreover, these are the ones we are interested in if we focus on functions of the parameters such as the multipliers reported in Figure 4. The upper part of Table C1 shows the IFs. Since our state space is enormous, we report summary statistics across coefficients and time. In principle, inefficiency factors below 20 are considered acceptable (Primiceri, 2005). For both the TVPs and the error variances we find IFs which are on average considerably below 10. For the volatilities, the maximum of IFs is 12.54.
For the TVPs, we find a maximum inefficiency factor of around 58. Notice, however, that the 95 th percentile is close to 20. Thus, according to the IFs, we find that our MCMC algorithm mixes rapidly and yields draws which display a relatively modest amount of autocorrelation across successive MCMC draws.
The bottom part of Table C1 shows the Raftery and Lewis (1992) diagnostics. This metric measures the number of iterations of the MCMC algorithm necessary to achieve a certain level of precision. These numbers are well below the total number of iterations (with a maximum of 1380 for β t and 968 for σ 2 t ). This suggests that our algorithm performs well empirically according to both diagnostics adopted. We start our analysis by considering how κ affects forecasting accuracy by repeating our forecasting exercise using the SVD approach combined with our set of priors and three different model sizes, but allow a range of values for κ. Results are given in Table C2. Note first that for κ = 1, which is the largest value we considered, no results are given for the FULL or NKPC models. For these cases, κ = 1 led to severe over-fitting problems and resulting poor forecast performance (e.g., LPLs of minus infinity). Clearly this value does not induce adequate shrinkage in larger models and care must be taken to avoid such regions of the hyperparameter space. It is worth noting that for the TVP-RW-SVD specification values of κ exceeding 0.05 are already too large. This is because κ effectively determines the upper bound of the size of a single shock within a given point in time and hence, larger values imply excessive amounts of time-variation and elevated risks of overfitting.

C.2 Prior Robustness
However, provided κ is kept small, we find a high degree of prior robustness. If we consider point forecast performance, we find that as long as the upper bound is specified between 0.01 and 0.1, point forecasts are only slightly affected by the choice of κ. No clear patterns emerge.
For one-step-ahead forecasts using the FULL model estimated using the g-prior, relative RMSEs seem to be inversely related to κ. But this does not carry over to the four-step-ahead horizon.
For longer-run forecasts, we observe that accuracy is largest if κ is set equal to 0.05. For the NKPC model, a similar U-shaped pattern arises, indicating that the optimal value of κ should be between 0.005 and 0.05. The only model that strongly profits from using a larger scaling paramater is the UCM. Here, we find that the best forecasting performance is obtained if κ = 1, a choice that seriously distorts predictive accuracy for larger models. In the case of the Minnesota prior, the specific choice of κ plays a limited role, with point forecasts of the full model being indistinguishable from each other for the one-quarter-ahead horizon and quite similar for the one-year-ahead horizon.
When the full predictive distribution is considered, we also find only limited differences in predictive accuracy for varying values of κ for models which feature a block-diagonal Z. In this case, and for the large-scale model, differences are typically quite small. This suggests that the precise value of κ, as long as it is not specified too large, plays only a minor role in impacting forecasting accuracy. Notice, however, that this does not carry over to smaller models and models which feature a lower triangular Z. In these cases, we find that using a smaller κ improves LPLs for both priors and forecast horizons.
When we consider the performance of the UCM models based on the full predictive density, we find that forecast accuracy appreciably increases with κ. This contrasts with the findings for the other models. This is due to the fact that with small values of κ, it becomes increasingly difficult to control for unobserved heterogeneity and the resulting model approaches a white noise specification. To sum up this discussion, we find that the precise value of κ plays only a limited role for forecasting accuracy when the large model is adopted, provided we avoid large values of κ which clearly lead to over-fitting problems. In contrast, in smaller models, the precise value of κ has a bigger impact on forecasting accuracy and the researcher needs to carefully select this hyperparameter. Notes: The table shows RMSEs with LPL's in parentheses below. Asterisks indicate statistical significance for each model relative to a random walk at the 1 ( * * * ), 5 ( * * ) and 10 ( * ) percent significance levels.