Inducing Sparsity and Shrinkage in Time-Varying Parameter Models

Time-varying parameter (TVP) models have the potential to be over-parameterized, particularly when the number of variables in the model is large. Global-local priors are increasingly used to induce shrinkage in such models. But the estimates produced by these priors can still have appreciable uncertainty. Sparsification has the potential to remove this uncertainty and improve forecasts. In this paper, we develop computationally simple methods which both shrink and sparsify TVP models. In a simulated data exercise we show the benefits of our shrink-then-sparsify approach in a variety of sparse and dense TVP regressions. In a macroeconomic forecast exercise, we find our approach to substantially improve forecast performance relative to shrinkage alone.


Introduction
Time-varying parameter (TVP) regressions and Vector Autoregressions (VARs) have enjoyed great popularity among econometricians in recent years as a way of modelling the parameter change that occurs in many macroeconomic and financial time series variables. These are state space models which have been found to work well in forecasting (e.g. D' Agostino et al., 2013) and been successfully used for structural economic analysis in a changing environment (e.g. Cogley and Sargent, 2005;Primiceri, 2005). They are flexible and capable of modelling almost any nonlinear relationship between explanatory and dependent variables. However, this flexibility comes at a cost: TVP models can be over-parameterized and suffer from the curse of dimensionality, particularly when the number of potential explanatory variables is large. This can lead to very good in-sample fit, but poor out-ofsample forecast performance.
There is a large and growing literature that proposes various methods for overcoming these overparameterization concerns using Bayesian methods (see, among others, Frühwirth-Schnatter and Wagner, 2010; Belmonte et al., 2014;Kalli and Griffin, 2014;Kowal et al., 2017;Uribe and Lopes, 2017;Rockova and McAlinn, 2017;Koop and Korobilis, 2018;Bitto and Frühwirth-Schnatter, 2019;Eisenstat et al., 2019). These papers propose different approaches to obtain more precise inference. Much of this literature uses hierarchical global-local shrinkage priors. A key property of these priors is that they ensure shrinkage in the sense that they pull coefficients towards zero. However, they do not impose them to be exactly zero and, thus, estimation uncertainty remains. In contrast to shrinkage approaches, selection approaches seek to choose a single sparse specification. That is, they select a particular set of explanatory variables and, by doing so, impose coefficients on non-selected explanatory variables to be zero. 1 Which is better: shrinkage or sparsity? The answer to this question depends on the empirical application. In the case of constant coefficient regressions and VARs, there is debate among Bayesian econometricians as to whether models are sparse (in which case sparsification methods are appropriate) or dense (in which case shrinkage is appropriate). For instance, Giannone et al. (2017) considers a range of data sets in macroeconomics, microeconomics and finance and finds evidence mostly in favor of dense models, a finding reinforced by Cross et al. (2019). But why not do both? This is exactly what recent papers such as Hahn and Carvalho (2015) propose. That is, first shrinkage is done using a Bayesian global-local shrinkage prior and then sparsification is done on the resulting estimates.
Such an approach could add the benefits of sparsity (i.e. the reduction in estimation error that is important for improving forecasts) along with the benefits of shrinkage which are so useful with dense data sets. Until recently, one reason not to do both was computation. Bayesian inference with hierarchical shrinkage priors requires computationally-burdensome MCMC methods. Adding a second sparsification step can greatly increase the burden (i.e. this step often uses cross-validation methods for choosing key tuning parameters which can be computationally costly). However, a recent paper, Ray and Bhattacharya (2018), proposes a very simple algorithm, the signal adaptive variable selector (SAVS), for doing the sparsification step. This involves no tuning parameters and is computationally trivial. Ray and Bhattacharya (2018) provides a theoretical justification for SAVS and shows it to have good empirical performance in simulated and real data contexts.
The papers cited in the preceding paragraph all relate to constant coefficient regression or VAR models rather than the TVP state space models which are the focus of this paper. We develop Bayesian methods for inference and forecasting in TVP regressions and TVP-VARs which both shrink and sparsify. The shrinkage step can be done using any of the hierarchical shrinkage priors that have been used with TVP regressions. In this paper, we use the Dirichlet-Laplace prior (see Bhattacharya et al., 2015), a fully hierarchical variant of the stochastic search variable selection prior (see George and McCulloch, 1993;Ishwaran and Rao, 2005;George et al., 2008), the Horseshoe (see Carvalho et al., 2010), the Bayesian Lasso of Park and Casella (2008) and the Normal-Gamma prior of Griffin and Brown (2010). The sparsity step is done using the SAVS method of Ray and Bhattacharya (2018).
Another extension we make in this paper relative to the shrink-then-sparsify methods of Hahn and Carvalho (2015) and Ray and Bhattacharya (2018) is that we allow for uncertainty in the sparsified estimates. That is, Hahn and Carvalho (2015) and Ray and Bhattacharya (2018) take the posterior mean from the shrinkage step and use only this in the sparsification step. We sparsify every MCMC draw in the shrinkage step, thus allowing for parameter uncertainty. Our methods are illustrated with simulated and real data and we find them to improve estimation accuracy and forecast performance.
The remainder of this paper is organized as follows: Section 2 discusses various global-local shrinkage priors in the context of the regression model with constant coefficients. It describes how the sparsification strategy of Ray and Bhattacharya (2018) works in the regression model. Section 3 ex-ECB Working Paper Series No 2325 / November 2019 4 tends these methods to TVP regressions and TVP-VARs. Section 4 investigates the performance of our methods relative to non-sparsified alternatives using simulated data from a range of sparse and dense TVP regressions. Section 5 carries out a forecasting exercise using TVP-VARs. A comparison of forecasts which are both shrunk and sparsified to those which are only shrunk shows the benefits of doing both. Section 6 concludes the paper and a technical appendix provides further details on the specific prior setup and the posterior simulation algorithms.

Shrinkage and Sparsification in Regression Models
In this section we describe the shrinkage and sparsification methods for regression which we build on in this paper. In the next section, we will show how they can be adapted for dynamic regressions and multiple equation models such as VARs. Consider the regression model: for t = 1, . . . , T where y t is a scalar dependent variable, X t is a K × 1 vector of explanatory variables and β is a K-dimensional vector of regression coefficients. The errors are assumed to be independent and follow a zero mean Gaussian distribution with variance σ 2 ε .
When K is large relative to T , Bayesians increasingly use hierarchical priors so as to induce shrinkage. Global-local shrinkage priors are particularly popular (see, e.g., Polson and Scott, 2010).
These contain shrinkage that is both global (i.e. common to all parameters) and local (i.e. specific to each parameter). We consider priors which can be represented as scale mixtures of Gaussians. In particular, for the j th regression coefficient we assume: Global shrinkage is controlled by λ while φ j handles the shrinkage of coefficient j. f and g are mixing densities and many different choices have been proposed for them. In this paper, we consider the Horseshoe (HS) prior of Carvalho et al. (2010), the Bayesian Lasso (Lasso) of Park and Casella (2008), the Normal-Gamma (NG) prior of Griffin and Brown (2010), the Dirichlet-Laplace (DL) prior of Bhattacharya et al. (2015) and the Normal-mixture of Inverse Gamma (NMIG) prior of Ishwaran  and McCulloch (1993and McCulloch ( , 1997. All of these are global-local shrinkage priors and differ from one another only in the choices of f and g. In addition, and unless otherwise noted, we use a weakly informative inverted Gamma prior on σ 2 ε with hyperparameters d σ = e σ = 0.01. Using any of these global-local shrinkage priors, MCMC methods can be used to carry out posterior inference and calculate the posterior mean,β. This estimate has been shrunk, but not sparsified. Even though many elements ofβ will be near zero, they will not be precisely zero and there will be estimation uncertainty associated with them. Sparsification, as used in Hahn and Carvalho (2015) and Ray and Bhattacharya (2018), proceeds by takingβ and setting small elements in it to zero. We first define the SAVS estimate and then offer some explanation and motivation for it. The SAVS estimate is with X j denoting the j th column of a T ×K matrix X = (X 1 , . . . , X T ) , (x) + = max(x, 0) and sign(x) returns the sign of x. Note that this is a soft-thresholding approach where all values of γ j below a certain value are set to zero and that it only acts on the posterior mean.
The sparsified estimate depends on tuning parameters, κ j , which determine the thresholds for each coefficient. Various approaches to selecting these have been proposed in the literature including computationally-intensive approaches such as cross-validation. However, a recent paper, Ray and Bhattacharya (2018), comes up with a surprisingly simple solution. This is to set: This choice implies a penalty for the j th variable which is ranked in inverse-squared order relative to the magnitude of the j th coefficient. With this choice of thresholds, the SAVS estimate is trivial to calculate.
To provide some motivation for the SAVS estimate note that (3) can be obtained by first solving an optimization problem akin to the Lasso: Equation (5) tries to find a sparse coefficient vector γ that is close toβ while introducing a penalty in case of non-zero elements in γ.
The typical way to solve this optimization problem is using a coordinate descent algorithm (Friedman et al., 2007). But, as shown in Ray and Bhattacharya (2018), if you initialize this algorithm at β and then do one iteration you get precisely the simple algorithm described in (3) and (4). It is also noted in Ray and Bhattacharya (2018) that convergence almost always occurs after one iteration and, hence, stopping after one iteration is a sensible thing to do.
One key shortcoming of computing the SAVS estimate is that uncertainty quantification about γ is not possible and computing non-linear functions of γ calls for Monte Carlo integration techniques. Ray and Bhattacharya (2018) highlight that one potential solution to this issue is to replaceβ with a draw from the full conditional posterior distribution of β. This is an insight we build upon in the context of the TVP models which are the focus of this paper.

Shrinkage and Sparsification in TVP Models
In this section, we develop methods for shrinkage and sparsification in state space models such as the TVP regression and the TVP-VAR. This is achieved using the non-centered parameterization of Frühwirth-Schnatter and Wagner (2010). We emphasize that the algorithms below do the sparsification at each draw from the MCMC algorithm, allowing for treatment of uncertainty in the shrinkage step.
Thus, the algorithms are averaging over different sparsified estimators in a manner similar to Bayesian model averaging.

The TVP Regression Model
The TVP regression model used in this paper takes the form: where all definitions are the same as in (1) except that β t are dynamic (time-varying) regression coefficients which follow a random walk with w t ∼ N(0 K , V ), where V = diag(v 1 , . . . , v K ) denotes ECB Working Paper Series No 2325 / November 2019 the variance-covariance matrix of the state innovations while v j (j = 1, . . . , K) is a process innovation variance associated with the j th coefficient.
The non-centered parameterization of this model is given by: with the j th element ofβ t given byβ jt = . This equation can be written as: and denotes element-wise multiplication.
Conditional on knowing the full history of the states inβ t , (6) resembles a standard regression model with a (partially) latent covariate vector Z t .
Well-developed MCMC methods exist to carry out Bayesian posterior and predictive inference in state space models such as the TVP regression model under various priors. In this paper, we simulate the full history of the normalized dynamic regression coefficients {β t } T t=1 using the forwardfiltering backward-sampling algorithm proposed in Carter and Kohn (1994) and Frühwirth-Schnatter (1994). Conditional onβ t , (6) is a standard regression model, implying that we can simulate α from a Gaussian full conditional posterior distribution and σ 2 ε from an inverted Gamma distribution. The corresponding moments take standard forms and are presented in Appendix B.
We propose to do shrinkage on α using the global-local mixture priors mentioned in the previous section and described in Appendix A. That is, conditional on a draw of the full history of the states, {β t } T t=1 , we have the regression model given in (6), and shrinkage can be done exactly as described in the preceding section. For each of the global-local mixture priors we consider, MCMC methods for drawing α and σ 2 ε , conditional on draws of the states exist. For the Dirichlet-Laplace prior we follow the methods of Bhattacharya et al. (2015). For the NMIG specification, we adopt the algorithm proposed in Ishwaran and Rao (2005) while for the Horseshoe, the MCMC algorithm developed in Makalic and Schmidt (2016) is used. Since the Normal-Gamma prior nests the Bayesian Lasso, we adopt the algorithm put forth in Griffin and Brown (2010)  Every draw of α, denoted as α (n) , from any of the MCMC algorithms is sparsified using SAVS.
Applying the SAVS estimator in (3) to each draw from the posterior of α yields: where Z j denotes the j th column of Z = (Z 1 , . . . , Z T ) , κ j = |α proposed procedure as an approximate MCMC algorithm which draws from the sparsified conditional posterior p(γ|α, Z). 2 Hence, forecasts produced will average over different sparsified models. That is, one MCMC draw will lead to one particular sparsified model which is used for forecasting, another draw may choose another sparsified model to produce forecasts. Hence, what we are proposing is similar in spirit to Bayesian model averaging. Of course, it is possible to use the SAVS algorithm directly on the posterior mean of α as is done by Hahn and Carvalho (2015) and Ray and Bhattacharya (2018). This would be similar in spirit to a Bayesian model selection strategy where a single sparsified model was chosen for forecasting. But this would ignore model uncertainty and, in addition, would be problematic since Z t is partially latent.
Another point worth emphasizing about our algorithm is that it is fast. Relative to the computational time required to do MCMC, adding the SAVS step increases the computational burden by a trivial amount. For any empirical specification where MCMC is possible, our proposed algorithm is also possible. Of course, if K is too large, then MCMC methods may be computationally infeasible.
In such a case, variational Bayesian methods may be a practical alternative (see Koop and Korobilis, 2018). But with variational Bayes methods, the SAVS algorithm would be applied on the approximate posterior mean and model uncertainty ignored. 3

The TVP-VAR
The shrink-then-sparsify algorithm we propose for the TVP regression can be extended to handle the TVP-VAR in a straightforward fashion. The idea is to transform the TVP-VAR so that the error 2 The algorithm is approximate since σ 2 ε does not play a role in the SAVS algorithm. If desired, after each sparsification, one could take a draw of σ 2 ε conditional on the sparsified estimates. 3 It would be possible to surmount this drawback of variational Bayes by first using variational Bayes to obtain an approximation to the posterior and then applying the SAVS algorithm to draws from this approximation. But this would be computationally demanding, thus undermining the main advantage of variational Bayes.

covariance matrix in the measurement equation is diagonal. Then the TVP regression algorithm of the preceding sub-section can be applied one equation at a time. Equation-by-equation estimation of
VARs is done in several recent papers using transformations similar to the one used here (see, e.g., Carriero et al., 2016;Kastner and Huber, 2017;Koop et al., 2019) and the reader is refered to these papers for further details about the computational advantages of this approach. With macroeconomic data it is often important to add stochastic volatility, which leads us to the TVP-VAR-SV specification described in this section.
Let y t be an M × 1 vector of endogenous variables for t = 1, . . . , T . The TVP-VAR-SV can be written as: where X t = (y t−1 , . . . , y t−P , 1) contains the P lags of y t and an intercept, β t is the vector K = M (M P + 1) coefficients at time t which is assumed to evolve according to a multivariate random walk. The errors are independent over time with Let U t denote a lower uni-triangular matrix and H t = diag(e h 1t , . . . , e h M t ). The M (M − 1)/2 free elements in U t follow independent random walks while the h jt 's are log-volatilities that follow AR (1) processes, Here, we let µ j denote the unconditional mean, ρ j the persistence parameter and σ 2 η the error variance of the log-volatility process. The prior specification on the parameters of the log-volatility equation closely follows Kastner and Frühwirth-Schnatter (2014). Specifically, we use a zero mean Gaussian prior with variance 10 2 on µ j , a Beta prior on ρ j +1 2 ∼ B(25, 5) and a Gamma prior on σ 2 η ∼ G(1/2, 1/2).
The Gamma prior translates into a Gaussian prior on ±σ η with zero mean and unit variance. In the MCMC algorithm, the full history of h jt as well as the parameters of equation (9) where η it and η jt are independent for i = j, β it denotes the elements of β t in the i th equation and We then write the TVP-VAR-SV using the non-centered parameterization. For equation i we obtain: Here, we let Since the errors in the different equations are independent of one another, estimation of one equation at a time using the algorithm of the preceding sub-section, including the SAVS step detailed in (7), can be done. Computation is also sped up since parallelization is feasible. Note also that, since the coefficients in U −1 t are appearing as regression coefficients in (10), these can also be shrunk and sparsified. In large TVP-VARs, where there are many such error covariance terms, this is potentially beneficial for forecasting purposes. Notice that we do not only obtain a sparse error covariance matrix but also allow for testing whether the corresponding free elements are time-varying or constant.

ECB Working Paper Series No 2325 / November 2019 11
In this section, we present evidence on the performance of the proposed methodology using artificial data generated from different TVP regression models. Across the different data generating processes (DGPs), the covariates are drawn from a Uniform distribution bounded between −1 and 1. The β t 's are generated using the non-centered parameterization with β 0 ∼ N(0 K , 0.1 2 I K ) and ± √ v j ∼ N(0, 0.1 2 ), j = 1, . . . , K, while differing percentages of the elements in α are randomly set to zero.
The measurement error variance σ 2 ε is set equal to 0.1 2 .
Before presenting results using repeated samples, the main features of sparsification are illustrated in Figure 1. The results in the three panels of the figure are based on the Horseshoe prior and use three different single artificial data sets obtained by simulating T = 400 observations from a large (K = 30) dynamic regression model. Figure 1(a) plots posterior features of β jt against time for a case where it is zero (i.e. the DGP is one where j th regressor is not selected) using a non-sparsified and sparsified estimator. Note that the sparsified estimator is precisely correct, it sets β jt = 0 with probability one. Thus, it exactly coincides with the true value and cannot be seen in Figure 1(a). The non-sparsified posterior distribution, although the posterior mean is very close to the correct value, has a credible interval that is non-negligible. This reflects estimation uncertainty and could spill over into poor forecast performance using the non-sparsified posterior. The performance of the SAVS algorithm when β jt is a non-zero constant (i.e. the DGP is one where β jt = β jt−1 for all t) is shown in Figure 1(b). In this case, the posterior distributions of the sparsified and non-sparsified models almost coincide. Notice, however, that the credible sets are constant over time for the sparsified model, indicating that the corresponding element in √ V is set equal to zero throughout all iterations of the MCMC algorithm. In contrast, Figure 1(c) illustrates a case where β jt is non-zero and time-varying.
Notice that the sparsified and non-sparsified posterior distributions are close to being identical. In this case, it is not desirable to sparsify the corresponding elements in α and the SAVS algorithm is not doing so. Thus, regardless of whether a coefficient is zero, a non-zero constant or time-varying, this figure indicates that our methods estimate it well. They work better than the non-sparsified alternative in cases where there is sparsity and equally well in non-sparse cases.   Table 1 are averages taken over three dimensions: i) the 100 simulated data sets, ii) time and iii) the K elements of β t . Table 1 shows the value of sparsification, particularly with sparse DGPs. With the latter, mean absolute errors (MAEs) are lower than their non-sparsified counterpart for every prior and choice for T and K. But even in moderately dense specifications, sparsification lowers MAEs in most cases. In the dense specification, sparsification does not improve upon the single best performing non-sparsified model specification. However, in that situation, accuracy differences are found to be negligible.
The benefits of shrinking and sparsifying increase with the number of explanatory variables. Note, for instance, that the unsparsified Flat prior model does not perform that poorly when K = 3 and 15, while the unsparsified NMIG model shows the best performance. Notice that in this situation, accuracy differences across the sparsified and non-sparsified NMIG specification are, however, quite small.
From this discussion it is apparent that identifying a default prior choice is difficult. One key take away from this analysis, however, is that if the DGP is sparse, flexible shrinkage specifications such as the HS, the DL and the NMIG prior in combination with the SAVS algorithm provide accurate parameter estimates. Overall, the table tells a story of the importance of both shrinkage and sparsity, especially in large models, with the precise choice of shrinkage prior being of lesser importance.
In the next step, we assess how well the SAVS algorithm identifies true zeros in α. Table 2 shows average hit rates that measure the percentage of correctly estimated zeros using the SAVS algorithm.
From this table, we observe that irrespective of the priors used, our approach works well in identifying the true level of sparsity. For sparse situations, the fraction of correctly identified zeros is often above 95% for most shrinkage priors and model sizes considered. In the case of a Flat prior, we observe values just above 90%, which is remarkable but still well below the percentages observed for the different shrinkage specifications under scrutiny. This slightly weaker performance can be traced back to the fact that without shrinkage, values in α are not pushed to zero and the corresponding penalty κ j is too small. Consistent with the findings in Table 1, we find no discernible differences in performance across the different shrinkage priors, with all of them displaying a strong performance. In fact, in a sparse setting with K = 30, the SAVS algorithm identifies almost all zeros correctly, with hit rates being above 99%.
To sum up, this discussion highlights that sparsification improves estimation accuracy. These improvements tend to increase with model size and the level of sparsity of the DGP. Among the set of competing shrinkage priors, we find no single best performing specification. In terms of correctly predicting zeros in α, we found that SAVS works well across all shrinkage priors considered, often correctly identifying above 99% of the zeros. At this point, and before proceeding to the empirical application, it is worth emphasizing that our analysis only considers whether our shrink-then-sparsify approach improves accuracy of point estimates, ignoring a potential bias-variance tradeoff. One key ECB Working Paper Series No 2325 / November 2019 finding is that applying SAVS never significantly decreases estimation accuracy and correctly predicts a large fraction of true zeros. In light of Figure 1, this indicates that, by zeroing out shrunk coefficients, our approach strongly reduces parameter uncertainty.

Forecasting US Macroeconomic Variables
In this section we present results from a forecasting exercise using US quarterly macroeconomic data taken from the FRED-QD database (see McCracken and Ng, 2016)  We use the TVP-VAR-SV of Sub-section 3.2 combined with the same set of global-local shrinkage priors as in the preceding section. The only specification we do not consider here is the TVP-VAR-SV with a flat prior since this model performs poorly in out-of-sample forecasting and large dimensions.
For each prior, we use non-sparsified and sparsified versions of the model in order to produce the forecasts. We forecast with small (M = 3), medium (M = 8) and large (M = 20) data sets and set the lag length equal to 2. Thus, the dimension of the state space in the TVP-VAR-SVs ranges from being moderate to huge. Our forecast evaluation begins in 1997Q1 and runs to the end of the sample. We use root mean squared forecast errors (RMSEs) to evaluate the quality of the point forecasts and average log predictive likelihoods (LPLs) to evaluate the quality of predictive densities. Both are benchmarked relative to a VAR-SV with DL prior, a specification that works well for US macroeconomic data (see Kastner and Huber, 2017). This is identical to the TVP-VAR-SV with DL prior except that the DL prior now applies directly to the constant VAR coefficients while V β i and v u ij are set equal to zero for all i, j. The VAR is transformed to allow for equation-by-equation estimation as described in Sub-section 3.2.
Before presenting the results of our forecasting exercise, we present Figure 2 which sheds light on which variables our algorithm is choosing to predict the focus variables. This figure is produced using the large data set and the HS prior. Previously, we have discussed how doing sparsification for each MCMC draw shares similarities with Bayesian model averaging. For one draw a certain sparsified model will be used to forecast, at another draw a different model will be used. This feature allows us to calculate posterior inclusion probabilities (PIPs) for each variable. The PIP for a given coefficient is the proportion of MCMC draws for which the coefficient is not set to zero. Figure 2 is a heatmap of these PIPs at the end of the sample. Remember that, in the non-centered parameterization of the TVP-VAR-SV (see equation (11)  there is only one coefficient which is always selected (i.e. has a PIP of one). This is the first lag of the 1-year Treasury Bill rate in the equation for the Fed Funds rate. However, an appreciable number of other predictors have PIPs that are substantially above zero but much less than one. In terms of the error covariance matrix, a similar pattern emerges. There is only one error covariance term which is non-zero in every MCMC draw. 4 This is the covariance between the errors in the equations for two different inflation measures. However, there are several other error covariances with PIPs that are substantially above zero, even if they are below one. We stress that such a finding would not be possible if we were to use the SAVS algorithm directly on the posterior mean as opposed to using it on each MCMC draw. In the former case every PIP would be either zero or one with no values in between.
These patterns are consistent with those found in Giannone et al. (2017) who conclude "there seems to be a lot of uncertainty about whether certain predictors should be included in the model, which results into their selection only in a subset of the posterior draws. These findings reflect a substantial degree of collinearity among many predictors that carry similar information, hence complicating the task of structure discovery. In sum, model uncertainty is pervasive and the best prediction is obtained as a weighted average of several models." These features seem to be exactly what our algorithm is uncovering in an automatic fashion.
Finally, it is worth noting that there is evidence of time variation in several of the coefficients and our algorithm is automatically deciding which ones to allow to be time-varying. That is most of the PIPs which are appreciably above zero in the top half of the figure are also above zero in the bottom half. This pattern indicates a non-zero coefficient zero which is time varying. But our method also allows for a coefficient to be non-zero but constant. There are some cases which provide evidence of this. For instance, in the GDP growth equation the first lag of SP500 stock returns has a PIP which is appreciably above zero in the top half of the figure, but is much closer to zero in the bottom half of the figure. This pattern indicates support for a constant coefficient on this predictor.
The evidence in Figure 2 suggests that shrinking then sparsifying is working in a sensible fashion.
But the key test of our methodology is how well it forecasts. Table 3 presents the results of our forecasting exercise. A comparison of each set of sparsified forecasts to its non-sparsified counterpart shows the benefits of our shrink-then-sparsify strategy, particularly in large models. For M = 8 and M = 20, sparsification leads to substantial improvements in both RMSEs and LPLs in almost every case. These improvements are particularly noticeable for GDP forecasting for the one-step-ahead forecasts. In general, the benefits of sparsification are largest when using the DL or Lasso priors. For M = 3 the benefits of sparsification are less pronounced. In terms of RMSEs, there seems to be no benefits of sparsification, although it does lead to slight improvements in the density forecasts even for this already fairly parsimonious case. This smaller accuracy premium from sparsification can be traced back to the fact that, in small models, the increases in the predictive variance that arise from posterior uncertainty surrounding shrunk estimates are small relative to the variance contribution arising from the reduced-form shocks. In larger models, parameter uncertainty eventually accumulates and this seems to be detrimental for forecasting accuracy.
In relation to the benchmark VAR-SV model, it is interesting to note that it is inferior to the TVP-VAR-SV models for the small and medium data sets. Clearly, addition of time-variation in the VAR coefficients helps improve forecasts in these cases. However, in the large data set, the evidence is mixed. In this case, the RMSEs produced by the TVP-VAR-SV are substantially better than those produced by the VAR-SV. However, the density forecasts are not. This could be due to the fact that there is typically a tradeoff between model dimension and parameter change. In small models, there is often a need for a high degree of parameter change to adequately fit patterns in the data and alleviate potential omitted variable biases. But in larger models, the information provided by the additional variables can fit these patterns, leaving less of a role for parameter change. Thus, in high dimensional cases the VAR-SV might be adequate and the extra flexibility provided by a TVP-VAR-SV may not be required. Of course, if the correct specification has a zero coefficient, the non-sparsified approach would try and estimate the time-varying coefficient to be constant over time. But, as illustrated in Figure 1, estimation uncertainty (although reduced) would still exist which could potentially hurt the forecasting performance of our approach. Sparsification as done in this paper clearly helps, but in the large data set there are still some cases where the VAR-SV is superior. In such cases, a simple extension of our shrink-then-sparsify approach could help. In this paper, we have focused on sparsifying α in equation (  Notes: Numbers in parentheses refer to the average log predictive likelihoods (LPLs) vis-á-vis the BVAR-SV with a DL prior. DL refers to a TVP-VAR-SV with a Dirichlet-Laplace prior, Lasso to the Bayesian Lasso, NG to the Normal-Gamma prior, HS to the Horseshoe, and NMIG to the Normal-mixture of Inverse Gamma prior.
The results in Table 3 highlight that, when the full hold-out period is considered, sparsification improves predictive accuracy relative to a non-sparsified model specification. The magnitude of these improvements, however, depends on model size. In the next step, we ask whether accuracy differences could also be specific to certain periods in time. To this end, Figure 3(a) shows the evolution of the log predictive Bayes factor between the sparsified and non-sparsified large-scale TVP-VAR-SV with the HS prior over the hold-out period. 5 This Bayes factor is obtained by evaluating the onestep-ahead predictive density for the three focus variables jointly after integrating out the remaining variables. To investigate whether the gains in density forecasting performance stem from capturing higher order moments in the predictive distribution or from a more precise point forecast, Figure 3(b) shows cumulative squared one-step-ahead forecast errors averaged across the focus variables over time. Notes: The log-predictive Bayes factor between the sparsified and non-sparsified model is obtained by considering the joint one-step-ahead predictive density for the three focus variables and the squared forecast errors are averages across the one-step-ahead forecast errors for the focus variables. The black line in panel (b) refers to the sparsified squared forecast error while the red line denotes the non-sparsified model. The gray shaded areas refer to NBER reference recessions in the US.

Conclusions
Global-local shrinkage priors have enjoyed great popularity in over-parameterized regressions and VARs involving large numbers of variables. And, increasingly, they have been used with TVP versions of these models which are potentially even more over-parameterized. Use of such priors can potentially reduce estimation error and improve forecasts. However, estimation error is not completely eliminated and it is possible that further improvements in forecasting performance can be achieved by adding an additional sparsification step to shrunk estimates. In this paper, we have developed methods for doing so. In an artificial data exercise, we have shown that our shrink-then-sparsify approach to TVP regression leads to more accurate estimates for a variety of DGPs. Particularly large gains are found in sparse DGPs. In a macroeconomic forecasting exercise, adding sparsification to shrinkage also leads to substantial improvements in forecast performance.

Appendix A Global-local Shrinkage Priors
The first four sub-sections of this appendix provide relevant details on the prior setup, briefly discussing key features of the used priors, hyperparameter choices used, and relevant information necessary to perform posterior inference.
We use a Dirichlet distributed prior with intensity parameter a on ξ j . Finally, ζ is a global shrinkage term that follows a Gamma distribution. Notice that the relationship between this prior hierarchy and the general form provided in equation (2) can be seen by defining φ j = ω j ξ 2 j and λ = ζ 2 . Bhattacharya et al. (2015) show within the stylized normal means problem that the optimal value of a is specified to be (2K) −(1+ϕ) with ϕ being a positive number close to zero. Since this hyperparameter plays a crucial role in determining the shrinkage behavior of the DL prior, we estimate it using a prior which is a uniform distribution that is bounded between (2K) −1 and 1/2.
Posterior simulation can be carried out using a slightly modified variant of the MCMC algorithm proposed in Bhattacharya et al. (2015). The full conditional posterior distribution of ω j follows an inverse Gaussian distribution: The global shrinkage parameter ζ follows a generalized inverted Gaussian (GIG) distribution, Moreover, we draw the second set of local scaling parameters ξ j by introducing auxiliary variables T j that follow a GIG distribution: T j |a, α j ∼ GIG(a − 1, 1, 2|α j |).
We then set ξ j = T j / 2K i=1 T i to obtain a valid draw from the full conditional posterior of ξ j .
To simulate from the conditional posterior of a, we employ a Metropolis Hastings algorithm with a Gaussian proposal distribution truncated between (2K) −1 and 1/2. The variance of the proposal distribution is tuned during the first 20 percent of the burn-in stage of the MCMC sampler such that the acceptance rate is between 20 and 40 percent.

A.2 The Normal-Gamma Prior and the Lasso
As compared to the DL prior, the NG prior proposed in Griffin and Brown (2010), consists of a single group of idiosyncratic scaling factors φ j and a global shrinkage parameter λ = 1/λ. We assume that each α j follows a zero mean Gaussian distribution a priori: α j |φ j ,λ ∼ N(0, φ j ), φ j |λ ∼ G(ϑ, ϑλ/2),λ ∼ G(dλ, eλ).
Here, we let ϑ denote a hyperparameter that controls the tail behavior of the prior, with smaller values of ϑ leading to heavier tails and increasing mass is placed on zero while larger value do the opposite.
dλ and eλ are hyperparameters that control the overall degree of shrinkage, with values close to zero implying heavy shrinkage towards zero.
One key feature of the NG prior is that it nests the Bayesian Lasso of Park and Casella (2008) by setting ϑ = 1. Since ϑ plays a crucial role, we follow Griffin and Brown (2010) and  and introduce an Exponential prior on ϑ: ϑ ∼ Exp(ϑ).
ϑ is set equal to 1, pushing the prior towards the Bayesian Lasso. Moreover, we set dλ = eλ = 10 −4 , implying a disperse prior onλ and thus being consistent with heavy shrinkage (by allowing large values ofλ).
The hierarchical structure of the prior yields closed-form full conditionals for φ j andλ. The local scaling parameters φ j follow a GIG distribution: For the global shrinkage parameter, we obtain a Gamma-distributed full conditional posterior distribution:λ Finally, we obtain draws from the conditional posterior of ϑ by setting up a random walk MH algorithm in terms of log ϑ (see Griffin and Brown, 2010).

A.4 The Normal-mixture of Inverse Gamma Prior
The NMIG prior of Ishwaran and Rao (2005) extends the original SSVS prior proposed in McCulloch (1993, 1997) along several dimensions. To set the stage, we use a mixture of Gaussians prior distribution on α j : where δ j denotes a Bernoulli random variable with prior probability P rob(δ j = 1) = p while c is a constant close to zero and τ 2 j is a coefficient-specific scaling factor. Following Ishwaran and Rao (2005), we specify an inverse Gamma prior on τ 2 j and a Beta distributed prior on p: with d τ , e τ , d p and e p denoting hyperparameters. Notice that this specification implies conditional prior independence between the indicators δ j . However, the common prior inclusion probability p serves as a common factor, implying that marginally, the indicators are dependent. Ishwaran and Rao (2005) notice that after integrating out τ 2 j and p, the two components in the prior follow t-distributions. The hyperparameter d τ controls the degrees of freedom of the marginal prior while the variances are given by ce τ /d τ (for the spike component) and e τ /d τ (for the slab component).
In the empirical applications, we set e p = d p = 1, implying a Uniform prior on p and c = 2.5/10 5 .
Moreover, we set d τ = 5, leading to 10 degrees of freedom and e τ = 4. This is the benchmark prior setup as specified in Malsiner-Walli and Wagner (2011).
For this prior specification, all conditional posterior distributions are available in closed form. The full-conditional posterior of δ j follows a Bernoulli distribution with posterior probability p j given by: p j = P rob(δ j = 1|α j , τ 2 j , p) = .
The scaling factors τ 2 j follow an inverted Gamma distribution Finally, the posterior distribution of p is a Beta distribution: