Value-at-risk in the presence of asset price bubbles

ABSTRACT In this study, we respond to the criticism that the value-at-risk (VaR) measure fails during financial crises and is only applicable during periods without asset price bubbles. We propose a new dating mechanism that is based on the work of Phillips (2015) to date-stamp the origination and termination of the asset price bubbles. Our method relaxed the minimum bubble duration constraint in the original model, and the empirical application statistically identified the bubbles periods in nine stock markets (Australia, Canada, China, Germany, Spain, Hong Kong, Japan, the United Kingdom, and the United States). We choose the two most widely adopted VaR models (RiskMetrics and RiskMetrics 2006) to test the performance. Our results show that the RiskMetrics model fails in most periods, whereas the RiskMetrics 2006 performs efficiently in the periods with asset price bubbles. These results prove the criticism that all the VaR models fail during crises as invalid.


Introduction
Financial markets have experienced several crises over the last two decades.The Asian financial crisis in 1997, the dot-com bubble burst in 2001, the subprime mortgage crisis in 2008, and the European sovereign debt crisis in 2009 have driven the financial institutions to use better measures to manage downside risk.The losses caused by these financial crises were tremendous; the 2008 subprime mortgage crisis costed the US economy up to 14 USD trillion (Atkinson, Luttrell, & Rosenblum, 2013), which is equivalent to the average annual output of the entire US economy.Furthermore, the global economy has become more integrated and the cross-border financial flows have steadily increased.This widens the contagion effect and complicates the effects of the financial crisis further.
Financial crises are predominantly caused by the burst of asset price bubbles.An asset price bubble is formed when the asset's price deviates from its fundamental value.During the bubble booming period, an asset's price grows at an explosive rate.Blanchard and Watson (1982) suggest that asset bubbles last only till the market realizes it and makes corrections.The correction is usually associated with a large sale force that causes a plunge in the stock price.This process is commonly referred as a bubble burst.However, it is difficult to identify bubbles and date-stamp the bursts.
CONTACT Raymond Kwong raymond.kwong@cpce-polyu.edu.hkS1428, South Tower, PolyU West Kowloon Campus, Yau Ma Tei, Kowloon, Hong Kong The losses from the financial crises have led the practitioners and regulators to actively look for risk models that measure the downside risks.Over the past decade, value-at-risk (VaR) has emerged as one of the most popular methods for measuring the downside risks of financial investments.The downside risk of a financial investment predicts the minimum loss of a portfolio's value for a period with a certain probability.After the subprime mortgage crisis in 2008, practitioners and regulators criticised the VaR models for failing to reveal the underlying risk, which led many financial institutions to suffer unexpected losses well above the VaR value and resulted in a credit crunch.However, most criticism stems from the lack of statistical analysis, which leads to false conclusions on the efficiency of VaR models.
Although a considerable body of research criticises VaR for performing poorly during turbulent financial periods (see Mertzanis, 2013;Lockwood, 2015), most literature classifies crisis and non-crisis periods unclearly by its own subjective judgment.Halbleib and Pohlmeier (2012) proposed a data-driven VaR method that combines quantile forecasts to improve VaR's performance during a crisis.They formed three portfolios to represent small-, middle-, and large-cap stocks in the Dow Jones index and evaluated the performance in the claim period (1 January 2007 to 17 July 2007), crisis period (18 July 2007 to 1 July 2009), and crash period (1 September 2008 to 1 July 2009).The results show that the method performs well in the crisis periods.However, the crisis and non-crisis periods were defined arbitrarily without a statistical basis.Chen, Gerlach, Lin, and Lee (2012) studied the RiskMetrics models (Riskmetrics, 1996) and several generalised autoregressive conditional heteroscedasticity (GARCH) (Bollerslev, 1986) family VaR models under Bayesian forecasting tests in Japan, Hong Kong, Korea, and the US markets.Their study ranks the RiskMetrics model last among all models, and shows that all VaR models underestimate the risk level during a crisis period.However, similar to the work of Halbleib and Pohlmeier (2012), the crisis and non-crisis periods were defined arbitrarily and the crisis periods were considered to be the same across different stock markets with different characteristics, which is debatable.
Our study contributes to the literature that responds to these limitations, by empirically testing the VaR models' performance in periods with and without asset price bubbles, particularly in the periods of financial crisis after the bubbles burst.Against the background of criticism for VaR models, we perform a series of backtests to statistically evaluate its performance.Following Phillips, Shi, and Yu (2015), we use the backward supremum augmented Dickey-Fuller (BSADF) test to date-stamp the origination and termination of the bubbles.However, we modify the date-stamping algorithm to suit backtesting.We define three periods for the tests: the pre-bubble period, bubble period, and post-burst period.The bubble period is bounded by the bubble's origination and termination dates and the pre-and post-burst periods are defined as 2 years before and 2 years after the termination date, respectively.Our date-stamping method allows researchers to evaluate the financial models that may behave differently at different stages of a bubble.
Though VaR is a popular means to quantify downside risk, there is little consensus on the preferred VaR model.Köksal and Orhan (2013) tested a VaR model based on a simple autoregressive conditional heteroscedasticity (ARCH) setting in the developed and emerging markets during financial crisis.Their results show that VaR performs more inefficiently in developed countries than in emerging countries and, overall, fails to reveal the downside risk.However, their results may only be true for the ARCH-based VaR model.Bams, Blanchard, and Lehnert (2017) compared VaR's performance in the S&P 500, Dow Jones Industrial Average, and Nasdaq 100 indices.The results show that the VaR based on historical volatility measure outperforms other parametric VaR models.Lee, Chiou, and Lin (2006) employed Engle (2002) dynamic conditional correlation (DCC) estimators to estimate a portfolio's VaR.The authors found that VaR performance at a portfolio level may be unreliable due to the difficulty in forecasting the dynamic correlation among assets.For better modeling of a portfolio's volatility, Chiriac and Voev (2011) developed a multivariate vector fractionally integrated ARMA (VARFIMA) process that models the long memory characteristics of financial volatility.Furthermore, Engle, Ledoit, and Wolf (2019) proposed a new standard of DCC model that applies nonlinear shrinkage in the estimation of the portfolios of large dimensions.Recent studies such as Fiszeder, Fałdziński, and Molnár (2019) and Law, Li, and Yu (2020) incorporate the state-of-theart volatility estimation method in the calculation of portfolio VaR.The results show that the choice of volatility model significantly impacts the accuracy of the VaR estimates.
We follow Campbell and Shiller (1989) and Phillips et al. (2015) to consider the explosive behaviour in log price-dividend ratio for the detection of asset price bubble.To understand the VaR performance in different markets, we examine nine stock markets (Australia, Canada, China, Germany, Spain, Hong Kong, Japan, the UK, and the US, and proxy each of the market portfolios by its respective market index.In light of this, our analysis focuses on univariate VaR models and uses the two widely adopted univariate VaR models -RiskMetrics (Riskmetrics, 1996) and RiskMetrics 2006 (Zumbach, 2007) -in our empirical tests in response to the criticism of VaR failure.To evaluate the performance of the VaR models in different periods, we conduct the unconditional coverage tests (Kupiec, 1995), the independence tests (Christoffersen & Pelletier, 2004), and the joint coverage tests.The empirical results of both the conditional and unconditional coverage tests suggested that the RiskMetrics 2006 model adequately described the downside risks in all the nine markets during the bubble periods.However, one weakness noted in the RiskMetrics 2006 model was its tendency to occasionally overstate the downside risk after a market turbulence.Further, the RiskMetrics 2006 model reported zero VaR violations in post-burst periods in Australia, Canada, Japan, and the UK.The empirical results of the conditional coverage tests, unconditional coverage tests, and Fisher's exact tests suggested that the RiskMetrics 2006 model performs efficiently in most periods, rendering the criticism against VaR models statistically invalid.
The remainder of this paper is organised as follows: Section 2 discusses the rationale of using the log price/dividend ratio as an indicator of asset price bubbles.Section 3 explains the SADF and generalised SADF (GSADF) tests used to identify asset price bubbles and the date-stamping mechanism.Section 4 describes the RiskMetrics and RiskMetrics 2006 VaR models, as well as the backtesting methods.The empirical results of the GSADF tests, identification of the bubble periods, and VaR backtesting results for the sample markets are presented in Section 5. Section 6 concludes the study.

Asset prices and bubbles
Asset price bubbles are usually driven by speculative behaviours that bid up the asset prices beyond their fundamental values.The fundamental value of an asset is the sum of the discounted future cash flows.In the presence of bubbles, asset prices behave explosively.The log price of a security is defined as (1) where p t is the log price at a time t, p f t is the fundamental value, and b t is the bubble component (Campbell & Shiller, 1989).
In this equation 1, the value of p f t is the fundamental component of the stock price as it depends on the expected dividends.In contrast, b t is the speculative component as it is based on the future expectations on the stock price.The stock price will be explosive if the bubble component b t in the equation 1 is non-zero.
The log price-dividend ratio is the summation of a series of log dividend differences and the bubble components (Campbell & Shiller, 1989;Phillips et al., 2015).If the log dividend is stationary after differencing, an explosive behaviour of the log price-dividend ratio would be caused by the presence of a non-zero bubble component b t .Thus, we can detect if asset price bubbles exist by examining any explosive behaviour in the log pricedividend ratio series and non-explosive behaviour in the first difference log dividend series.

SADF and GSADF tests
Previous studies have suggested using a supremum of a set of recursive right-tailed augmented Dickey-Fuller (ADF) tests to detect the presence of stock bubbles (Dickey & Fuller, 1979).This SADF test applies the right-tailed ADF test with the null hypothesis of a unit root (ϕ ¼ 0) and the alternative hypothesis of an explosive root (ϕ > 0).
The regression model used in the SADF test is where k is the lag order and 2 t is the random error.
The SADF test begins with testing the first r 0 fraction of the observations.It is followed by repeated ADF tests till r 0 is increased to 1, which is denoted by ADF 1 r 0 .The forward sequence of the regression starts from observation 1 and ends with bTr w c, where b:c is the integer part of the argument, T is the total number of observations, and r w 2 ½r 0 ; 1� is the fraction of the observations.The SADF statistic is The corresponding asymptotic distribution of the SADF test is discussed by Phillips, Shi, and Yu (2014).The asymptotic distribution of the SADF test statistic for the null hypothesis that the true process is a random walk without drift is where W is a Wiener process.We determine if the behaviour of the data series is explosive by comparing the SADF statistic of the data series with the asymptotic distribution of the Dickey-Fuller t-statistic in equation 3. We perform a backward ADF (BADF) test to date-stamp the explosion.The BADF test performs the ADF test repeatedly by fixing the starting point of the sample at the first observation, and rolling the ending point from bTr 0 c to T. For example, if the testing sequence starts from r 1 (r 1 ¼ 0 in the BADF test) and ends at r 2 , the corresponding BADF test statistic would be BADF r 1 r 2 .The BADF test statistic is denoted by BADF r 2 because the first observation is the starting point.
The explosion originates at bTr e c when it is the first occurrence after the BADF r e statistic exceeds the critical value.Phillips, Wu, and Yu (2011) impose the conditions that the bubble duration must be longer than logðTÞ and the termination date of the explosion (bTr f c) must be the first occurrence after the observation bTr e c þ logðTÞ, when the BADF r f statistic is below the critical value.r 2 should vary with the number of observations in the testing window for it to diverge to infinity and eliminate type I errors for large T; specifically, they suggested setting cv β T r 2 ¼ logðlogðTr s ÞÞ=100.The test using the bubble date-stamping method of equation 4 is referred to as the PWY test in this paper.
The disadvantage of the PWY test is the possible failure when multiple bubbles are present in the sample.Phillips et al. (2015) proposed the generalised version of SADF (i.e., the GSADF test) to address this issue for its flexibility in allowing changes to the starting point of the testing window.The GSADF statistic is defined as The asymptotic distribution of the GSADF test is elaborated by Phillips et al. (2014).
The date-stamping method used in the GSADF test is an extended version of the one used in the BADF statistic.The BSADF test performs an SADF test by rolling the starting point of the test window r 1 2 ½0; r 2 À r 0 � from observation Tðr 2 À r 0 Þ b c to the first observation.The BSADF statistic for a testing sequence that starts at r 1 and ends at r 2 is defined as Similar to the PWY test, the origination date of a bubble in the GSADF test is Tr e b c, when Tr e b c is the first occurrence after the BSADF re statistic exceeds the critical value.The minimum bubble duration in the BSADF statistic is generalised to δlogðTÞ, where δ is a frequency-dependent parameter.Furthermore, the termination date of explosion Tr f � � is the first occurrence after the observation Tr e þ δlogðTÞ b c, when the BSADF r 2 statistic is below the critical value.The bubble date-stamping method of equation 5 is referred to as the PSY test in this paper.
The bubble origination date in the GSADF test is date-stamped using equation 5a, which is the ending point r 2 of the testing window in the BSADF statistic.The PSY datestamping method picks the ending point in an explosive series as the bubble origination date.Though this gives confidence in forward tests, it might miss the bubble formation period, which is crucial for a backtest.Alternately, we examine the bubble origination date by considering the starting point as r 1 instead of r 2 of the explosive series.Thereby, we modify the date-stamping method for the bubble origination date as in equation 6a.
The new bubble origination date is T� r e b c. � r 2 , in equation 6a, is the end of the testing sequence in the BSADF test for bubble origination.We use this ending point as a starting point to detect the termination date of the explosion T� r f � � in equation 6b.The minimum length of the bubble duration is the size of the testing window when the bubble origination date starts at r 1 .Thus, we address the minimum bubble duration δlogðTÞ constraint in the PSY test to obtain the bubble termination date.The differences between the PSY date-stamping method and the modified method are presented in section 5.2.This new date-stamping method for the bubble origination date, presented in equation 6, is used throughout the study.

VaR models and backtests
We compute daily 1% VaRs by using the RiskMetrics (Riskmetrics, 1996) and RiskMetrics 2006 (Zumbach, 2007) models to compare the performance of VaR in the pre-bubble, bubble, and post-burst periods.The RiskMetrics model assumes that the asset returns x t are normally distributed with mean μ t and variance σ 2 t .Using the standard normal cumulative distribution function Φðx t Þ ¼ 1= ffi ffi ffi ffi ffiffi 2π p P x À 1 e y 2 =2 dy and the cumulative distribution function of the asset return Fðx t ; μ t ; σ t Þ ¼ Φððx t À μ t Þ=σ t Þ, the α percent one-day VaR can be obtained using equation 7. The estimated values of u t and σ t are computed from an estimation window of size W E , which is set to 250 days in this study.As the Student's t-distribution is another prominent distribution to describe financial asset returns, further from the original RiskMetrics model, we study a variant model that uses cumulative t-distribution Γ v ð1 À αÞ with v degree of freedom to compute the α percent one-day VaR, and the formula is shown in equation 8.
RiskMetrics estimates the volatility σ t by using the second central moment of the asset returns x t .In contrast, RiskMetrics 2006 allocates a heavier weight to recent observations, while preserving the long-lasting impact of shocks.Towards this end, RiskMetrics 2006 introduces a hyperbolic decay factor h k ¼ expðÀ 1=τ k Þ, based on the geometric time horizon factor τ k defined using equation 9.
Here ρ is an operationalised parameter.RiskMetrics 2006 obtains the volatility σ 2 tþ1 using equation 10, by summing the K historical volatilities σ 2 k;t (defined in equation 11a) with logarithmic decay weights w k derived using equation 11b.
We define T as the total number of observations in the data set, W E as the size of the estimation window, and W T as the size of the testing window for VaR violations.A VaR violation (x t ¼ 1) is recorded when the loss on a trading day t exceeds the calculated VaR value.The total number of VaR violations ν 1 in the testing period W T is calculated using equation 12; ν 0 , calculated using equation 13, indicates the number of days without violations.
We employ three categories of backtests to test the accuracy of the VaR models: the unconditional, conditional, and joint coverage tests.Unconditional coverage tests evaluate the VaR models by testing the number of violations at a given confidence level.In the unconditional coverage tests, we employ Kupiec (1995)

Fisher's exact test
We perform the Fisher's exact test (Fisher, 1922) to examine the significance of associations between the VaR failure rates in different crisis periods.We perform the Fisher's exact test in this study due to the small sample size of VaR violations (1% VaR represents 5 violations of 500 observations).We use a 2 � 2 contingency table to represent the number of VaR violations in different periods.For instance, in Table 1, for T days in period A, the VaR measure performed well for v A 0 days but failed for v A 1 (T À v A 0 ) days.
Under the null hypothesis, the VaR failures in different periods are stochastically independent.The VaR failure rates show no significant difference between the crisis and non-crisis periods.The alternative hypothesis is that the systematic difference in VaR failures compared to the expectation in different periods is coincidental.

Data and empirical results
The nine stock markets employed in our empirical analysis represent both the developed and emerging markets.The selection was based on the market capitalisation and trading time zones, shown in Table 2. Monthly price and dividend data were used in the bubble tests and daily price data were used in the VaR backtests.The data were obtained from Bloomberg.

Pre-bubble, bubble, and post-burst periods
Table 3 shows the GSADF tests of the log price-dividend ratio and log dividend difference series of the nine markets.The finite sample of critical values used in the GSADF tests was obtained from 2,000 simulations.We followed Phillips et al. (2015) in determining the minimum window size of the test based on the rule r 0 ¼ 0:01 þ 1:8= ffi ffi ffiffi T p , where T is the corresponding sample size.As the GSADF test statistic of all the log pricedividend ratios exceeded their corresponding 10% right-tail critical values, the summary test showed an explosive behaviour in the log price-dividend ratio and a non-explosive behaviour in the log dividend difference series.This provided evidence for explosive subperiods in the nine markets.We performed both the PSY test and the modified test to date-stamp the origination and termination of the bubbles using equations 5 and 6, respectively.Table 4 shows the bubble date-stamping results from these methods in the nine stock markets.The estimated bubble origination dates were close to the termination dates in most cases and the bubble duration was short in the PSY date-stamping method.The average duration of the bubble period was 7 months.Although all the tests identified asset price bubbles around the subprime mortgage crisis, they exhibited a time lag.For the stock markets in Australia, Canada, Germany, Spain, Japan, the UK, and the US, the PSY method suggested that the bubbles originated in late 2008 and terminated in mid-2009.The origination dates were close to the original bubble burst, that is, when Lehman Brothers filed for the largest bankruptcy protection in the US history on 15 September 2008.The reason for the short durations of asset price bubbles indicated by the PSY test, contrary to general agreement that it takes time for the bubble to form, is the date-stamping strategy, which is based on choosing the ending point of the testing window when the whole sample is explosive.We believe that a more appropriate choice for the bubble origination date would be the starting point of the testing window, instead of the ending point.
The bubble origination dates identified by the modified method are from late 2002 to mid-2005 and the termination dates are from early 2009.Our results agree with those from the previous studies (Brown, Stein, & Zafar, 2015;Demyanyk & Van Hemert, 2011;Lewis, 2009;Mian & Sufi, 2009;Sanders, 2008).
To illustrate the differences between the two date-stamping methods, we used the Hong Kong stock market as an example.The PSY approach identified a five-month bubble from November 2007 to March 2018; this is illustrated in Figure 1, with the bubble period highlighted in grey.In contrast, the bubble period identified by the modified method was from April 2004 to October 2007 as shown in Figure 1b.Unlike the PSY approach, the result of the modified method did not indicate shorter durations and agreed well with the asset price bubble cycles of the subprime mortgage crisis (see Dell 'ariccia, Igan, & Laeven, 2012;Lewis, 2009;Tridico, 2012).

Backtesting results
We defined the pre-bubble and post-burst periods as 2 years before and after the bubble.We used Monte Carol Simulations to obtain the critical values with 2,000 replications following Phillips et al. (2015).The data generation process is y post À burst period Here p is the time frequency-dependent parameter; p ¼ 24 for the monthly data and p ¼ 500 for the daily data (we assume 250 trading days per year).The pre-bubble period extended from September 2002 to August 2004 and the post-burst period from November 2007 to October 2009 for a bubble period identified from September 2004 to October 2007, when Hong Kong was used as an example.Further, we define the period not covered by the pre-bubble, bubble, and post-burst periods as "normal periods."For example, two normal periods of September 1993 to August 2002 and November 2009 to October 2019 were defined from the entire period considered for the Hong Kong market (September 1993 to October 2019).Table 5 shows the bubble periods identified for the nine stock markets.One bubble period was identified each for Australia, Canada, China,  Germany, Hong Kong, Japan, and the UK, and two each for Spain and the US.The reason for the additional bubble period in Spain and the US could be the extent of data available.Spain's data starts from January 1991 and the US's data from February 1978.Tables 6 and 7 show the results of the RiskMetrics VaR backtests, and Tables 8 and 9 show the results of RiskMetrics 2006 in different periods.In all the hypothesis tests, we tested the null hypotheses at a significance level of 5%.For the periods where no VaR violation was observed, we have not reported the value for the independence tests as the results of the violation clustering test were ambiguous (see an example of the independence test and joint test results in the normal period for the UK (period 1) in Table 7).In the coverage tests, to deem a case as a "failure," the null hypotheses should be rejected in both coverage tests.An analogous criterion was also adopted for the independence tests and joint tests.
Table 6 shows that RiskMetrics performed poorly around turbulent financial times.Both the RiskMetrics models with normal distribution and t-distribution had analogous results.Minor differences were found in Canada and the UK but such have no impact on the results in the hypothesis tests.Within the large estimation window size used in the RiskMetrics model the fitted t-distribution approaches the normal.Further, RiskMetrics  Full 2000M4 1993M5 1997M11 1997M5 1991M1 1993M9 1993M5 2002Full 2000M4 1993M5 1997M11 1997M5 1991M1 1993M9 1993M5 M2 1978 1993M5 1997M11 1997M5 1991M1 1993M9 1993M5 2002 M2 1978 M2 to to to to to to to to to 2001M4 2000M10 2002M7 2003M4 1992M1 2002M8 2001M5 20022001M4 2000M10 2002M7 2003M4 1992M1 2002M8 2001M5 M3 1987M7 Period 2 20112001M4 2000M10 2002M7 2003M4 1992M1 2002M8 2001M5 M3 20112001M4 2000M10 2002M7 2003M4 1992M1 2002M8 2001M5 M3 2009M6 2011M4 2011M6 2009M11 20112001M4 2000M10 2002M7 2003M4 1992M1 2002M8 2001M5 M3 2011 M3 2011M4 M3 2011M4 to to to to to to to to to 2019M10 2019M10 2019M10 2019M10 2019M10 2019M10 2019M10 2019M10 2019M10 performed effectively only in the coverage tests for the pre-bubble period.We could not reject the null hypothesis because the probability of a violation was the same as the coverage rate for Australia, Canada, China, Spain, Hong Kong, Japan, and the US markets.The exceptions were Germany (p-value was 0.028 in the TUFF test) and the UK markets (p-value was 0.003 in the POF test).Violation clustering was also noted with Canada, Germany, the UK, and the US, which led to rejecting the null hypotheses in these markets in all the joint tests.
RiskMetrics performed inadequately in the bubble period as well.The null hypotheses of the POF coverage tests were rejected in all the nine markets.Although the Christoffersen's independence test no significant consecutive violations, the violation clustering was severe leading to the null hypotheses being rejected by the mixed-Kupiec independence test and all the joint tests.However, RiskMetrics performed noticeably better in the post-burst period as it rejected the null hypotheses only for China, Hong Kong, and Japan.The coverage tests (with a p-value of 0.001 in the POF test, and 0.011 in the TUFF test) showed that the number of violations in China was misspecified.The independence tests showed that the Japanese market exhibited violation clusters (with a p-value of 0.035 in the Christoffersen test and 0.005 in the mixed-Kupiec test).In the post-burst period, RiskMetrics underestimated the downside risk in China, Hong Kong, and Japan.Further, Table 7 shows that RiskMetrics presented inaccurate downside risk measures for both the normal and full periods.Similar to the results for the bubble period, the null hypotheses were rejected in most of the POF coverage tests, independence tests, and joint tests.We found that RiskMetrics failed to reveal the downside risk in most cases.
The results in Table 8 provide convincing evidence to show that RiskMetrics 2006 works efficiently in a period of market turbulence.In the pre-bubble period, we could not reject the null hypotheses in all the tests, except for China and the US.In China, the null hypothesis was rejected by both coverage tests (the p-value was 0.002 in the POF test and 0.033 in the TUFF test).In the US, both the independence tests (period 1) rejected the null hypothesis (with p-values of 0.034 and 0.020).Further, RiskMetrics 2006 performed efficiently in the bubble period.It failed only in the coverage tests and joint tests for the US market in period 2, defined in Table 5.However, the long memory RiskMetrics 2006 behaved conservatively in the post-burst period although no violation was found in Australia, Canada, Japan, the UK, and the US (period 2).In the normal period, RiskMetrics 2006 performed efficiently; failing only in period 2 of the independence tests for Australia (with a p-value of 0.008 in the Christoffersen test and 0.004 in the mixed-Kupiec test), Japan (with a p-value of 0.008 in the Christoffersen test and 0.008 in the mixed-Kupiec test), and the US (with a p-value of 0.002 in the Christoffersen test and 0.000 in the mixed-Kupiec test).
RiskMetrics 2006 behaved conservatively in overestimating the downside risk after the bubbles.However, in the total duration considered, the number of VaR violations were 34, 57, 49, 23, 45, 43, 43, 25, and 77 in Australia, Canada, China, Spain, Hong Kong, Japan, the UK, and the US, respectively.These violation numbers were significantly lower than those found by the RiskMetrics model, which were 100, 172, 112, 129, 139, 140, 146,   The decimal numbers in the cells are the p-values for assessing the null hypothesis that the model adequately measures the downside risk.Asterisk (*) indicates a p-value of less than 0.05.
108, and 212, respectively, as shown in Table 7.Although RiskMetrics 2006 may require financial institutions to allocate more capital than necessary to manage risks, it performed efficiently in most periods.

VaR performance in pre-bubble, bubble, and post-burst periods
We conducted the Fisher's exact tests for both the RiskMetrics and RiskMetrics 2006 models to gain further insight on their performance in different crisis periods.We performed 10 Fisher's exact tests to compare the VaR performance in each of the nine markets during the full, pre-bubble, bubble, post-burst, and normal periods.The results are shown in Table 10.Table 10 shows that unlike RiskMetrics, the RiskMetrics 2006 has a consistent performance regardless of the periods in all the nine markets.In each panel, the lower triangular matrix contain the p-values of the RiskMetrics tests, while the upper triangular matrix contains the p-values of the RiskMetrics 2006 tests.A p-value of Fisher test greater than 5% is indicative of no statistically significant relationship of the VaR failure rates between two chosen  periods.Panels 10(a), 10(g), and 10(h) show that the performance of RiskMetrics VaR differed in the bubble periods.The null hypotheses were rejected in the tests between the bubble period and the pre-bubble, post-burst, and full periods.After combining the results in Tables 6 and 7, the performance of RiskMetrics was found to be inconsistent in different periods, with the frequency of failure being higher in the bubble periods.Panel 10(i) shows considerable difference with RiskMetrics in the US post-burst period when compared to other periods.Comparing this with the results in Table 6, we observe that the RiskMetrics model failed in all the periods in the US except for the post-burst period.Similarly, panel 10(c) shows that it only performed well in the prebubble period in China.The overall results suggested a significant difference in the RiskMetrics performance across different countries in different crisis periods.Contrary to RiskMetrics, RiskMetrics 2006 did not show extensive performance differences in different periods.The VaR performance seen panel 10(d), panel 10(e), panel 10(f), panel 10(i) shows no significant difference, whereas occasional inconsistencies are observed in panel 10(b), 10(g), and 10(h).Specifically, the inconsistency in these three panels was between the bubble and the post-burst periods.The Fisher's test results support our finding previously mentioned in section 5.2 that though RiskMetrics 2006  The decimal numbers in the cells are the p-values for assessing the null hypothesis that the VaR failure rates show no significant differences between the periods.The results of RiskMetrics are shown in the lower triangular matrix (non-shaded) and the results of RiskMetrics 2006 are shown in the upper triangular matrix (shaded).Asterisk (*) indicates a p-value of less than 0.05.Pvalue of Fisher test larger than 0.05 is indicative no statistically significant relationship of the VaR failure between two chosen periods.
occasionally overstated the downside risk in post-burst periods, it performed efficiently in most periods.

Summary and conclusions
In this study, we conducted empirical tests to respond to the criticism that VaR models fail during financial crises when asset bubbles burst.Our empirical tests on the nine markets, namely Australia, Canada, China, Spain, Hong Kong, Japan, the UK, and the US, were conducted using data from the earliest available dates to October 2019.Asset price bubbles were date-stamped using the modified PSY tests on the log price-dividend ratios of these nine markets.However, as the date-stamping method of the original PSY test is forward-looking, it selects the ending point of the testing window as the bubble origination date.To suit our need for a backward-looking date-stamping method for the backtests, we modified the date-stamping method by choosing the starting point of the testing window.Our results show that the original date-stamping method has time lag and indicates a shorter duration for the bubble period.It date-stamps the bubble start near the end of a true bubble and the average duration of the bubble periods detected is 8 months.The results also show that our modified method addressed both the time lag and shorter duration issues present in the original date-stamping method and, thus, more suitable for backtesting.
The empirical test results of the nine backtests allowed us to draw two main conclusions.First, the RiskMetrics 2006 model outperforms the RiskMetrics model.Specifically, the former works well in pre-bubble, bubble, and normal periods but behaves conservatively in the post-burst period, which may overestimate the downside risk.Second, the criticism that all VaR models fail in crisis or bubble periods is statistically invalid.The power of VaR is affected by the modelling practices adopted by different financial institutions.
An interesting direction for future research would be to examine parametric and nonparametric VaR approaches.Non-parametric approaches comprise historical and Monte Carlo simulations, whereas parametric approaches include GARCH, GJR-GARCH (Glosten, Jagannathan, & Runkle, 1993), and the multivariate DCC (Engle, 2002) approach.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Notes on contributors
Raymond Kwong is a Senior Lecturer in the College of Professional and Continuing Education, the Hong Kong Polytechnic University.He received his PhD in Economics from the Newcastle University, United Kingdom.His research focuses on issues of asset price bubbles, market microstructure, price discovery and risk management in equity market.
Helen Wong is a Principal Lecturer in the College of Professional and Continuing Education, the Hong Kong Polytechnic University (HKPU).She has devoted her research study in customer relationship building, corporate social responsibility, accounting and finance.Prior joining HKPU, Helen has worked in several renowned corporations in Hong Kong and Canada, such as PricewaterhouseCoopers, Hong Kong Stock Exchange, University of Toronto, and Bank of Montreal.
. (2011) further recommended that the critical value cv β T period : T� r e b c À p; T� r e b c ½ �

( a )
Identification results from the GSADF test with the PSY date-stamping method.(b)Identification results from the GSADF test with the proposed datestamping method.

Figure 1 .
Figure 1.Bubble identification results for the hong kong stock market.
backtest results of the RiskMetrics models that based on t-distribution do not differ from those with normal distribution.The only different results from t-distribution are shown in the square bracket[].

Table 1 .
Contingency table of fisher's exact test for VaR measures in different periods.

Table 2 .
Stock exchanges and the respective indices used.Data on market capitalisation obtained from World Federation of Exchanges, June 2019; values are in US$ million.b Data obtained from the London Stock Exchange Main Market Factsheet, June 2019; values are in US$ million, converted at 1.26993 USD/ 1.

Table 3 .
GSADF tests for the log price-dividend ratio and log dividend difference in the nine markets.

Table 4 .
Bubble periods obtained by the PSY test and the modified date-stamping method.

Table 5 .
Full, pre-bubble, bubble, post-burst, and normal periods for the bubbles examined in the nine stock markets.

Table 6 .
Backtest results of riskmetrics VaR in the pre-bubble, bubble, and post-burst periods.

Table 7 .
Backtest results of riskmetrics VaR in the normal and full periods.The backtest results of the RiskMetrics models that based on t-distribution do not differ from those with normal distribution.The only different results from t-distribution are shown in the square bracket[].The decimal numbers in the cells are the p-values for assessing the null hypothesis that the model adequately measures the downside risk.Asterisk (*) indicates a p-value of less than 0.05.

Table 8 .
Backtest Results of RiskMetrics 2006 VaR in the Pre-bubble, Bubble, and Post-burst Periods.The decimal numbers in the cells are the p-values for assessing the null hypothesis that the model adequately measures the downside risk.Asterisk (*) indicates a p-value of less than 0.05.

Table 9 .
Backtest results of riskmetrics 2006 VaR in the full and normal periods.

Table 10 .
Fisher exact test for the significance level of the difference of VaR failure rate by time period.