On downscaling probabilities for heavy 24-hour precipitation events at seasonal-to-decadal scales

We examine a method for predicting variations in the probabilities and occurrence of intense local 24-hour precipitation events over seasonal, annual, and 5-yr intervals. The wet-day mean m can be used to describe the exponential distribution for the wet-day amounts, and provides a basis for forecasting the likelihood of seeing an event above a given threshold. A regression-based downscaling model is calibrated on annual m and wet-day frequency fw, respectively, and then tested on seasonal to multi-annual scales. The probabilities for heavy precipitation statistics are taken as the product between a fitted cumulative exponential distribution for the wet-day amounts m and fw. The annual number of heavy precipitation events is estimated from the annual probability, where a 90% confidence interval on the number is computed using the 5 and 95 percentiles of a binomial distribution for the predicted probability. The analysis identifies a strong link between large-scale predictors such as mean sea-level pressure or surface temperature and the wet-day frequency. There was a weaker dependency of the wet-day mean to large-scale predictors, which suggests that local-scale processes have a stronger influence on the 24-hour precipitation amounts. The results also suggest that similar physical processes influence the wet-day mean m on different timescales except for during winter, and that similar physical processes may explain the wet-day frequency on seasonal to annual time scales. The implication is that models calibrated on annual data samples in some cases can give skilful predictions on seasonal time scales.


Introduction
Can we make skilful forecasts of precipitation statistics seasons to decades ahead, given large-scale patterns in surface temperature and mean sea level pressure (SLP)? Local climate information is important for the users in addition to their credibility, salience, and legitimacy. Barsugli et al. (2013) proposed that establishing the credibility of downscaled climate data is more than just evaluation against historical observations. Skilful forecast on lead times from seasons to decades would provide a valuable basis for decision making for many sectors in our society (Palmer et al., 2000). For instance, such forecasts could guide farmers on choices regarding when and what to sow (Doblas-Reyes et al., 2007), hydroelectric power producers on how much water to retain in the reservoirs, and manufacturers on the volume of production. Furthermore, early warnings on extreme precipitation can help mitigating flooding and provide valuable time for making necessary preparations. One question is whether such early warning systems are possible, even if we could predict the general state and flow of the oceans and the air masses.
Here we try to answer this question in more specific terms: whether it is possible to predict local 24-hour precipitation statistics for a season, year, or a decade, given a representative description of the regional large-scale conditions. We also test the hypothesis whether dependencies between large-scale conditions and local rainfall found on annual timescales are valid for seasonal or decadal timescales. *Corresponding author. email: rasmus.benestad@met.no Tellus A 2015. # 2015 R. E. Benestad and A. Mezghani. This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), allowing third parties to copy and redistribute the material in any medium or format and to remix, transform, and build upon the material for any purpose, even commercially, provided the original work is properly cited and states its license.

Background
Several studies in the past have investigated the possibility of skilful seasonal forecasts for large-scale regional conditions based on general circulation models (GCMs) and statistical forecasting models (Barnett and Preisendorfer, 1987;Barnston et al., 1994;Palmer and Anderson, 1994;Dix and Hunt, 1995;Johansson et al., 1998;Stockdale et al., 1998;Derome et al., 2001Derome et al., , 2005Rodwell and Doblas-Reyes, 2006). Some studies have suggested that a blend of results from different types of GCMs may improve the forecast skill (Krishnamurti et al., 1999;Yun et al., 2003); however, the question whether such blending results in a real enhancement of skill has also been questioned (Weigel et al., 2008).
Seasonal forecasting has been far more challenging for the North Atlantic and European region than for the Tropics, where the El Nin˜o Southern Oscillation (ENSO) is a major factor. There have been some studies on teleconnections between ENSO and Europe (Lloyd-Huges and Saunders, 2002), but so far with little success in enhancing seasonal forecasting model skill. A major effort was undertaken to develop a European multimodel ensemble system for seasonal to inter-annual prediction, DEMETER  and ENSEMBLES (Linden and Mitchell, 2009). A large part of the seasonal variability over Europe can be linked to variations in the North Atlantic Oscillation (NAO), the jet stream, and planetary Rossby waves; hence, skilful seasonal forecasts need to forecast their future state (Derome et al., 2005;Johansson, 2007). Other factors that may have an effect on the subsequent season over Europe may include the snow cover or the Arctic sea ice (Ueda et al., 2003;Benestad et al., 2010;Orsolini et al., 2011). However, it is difficult to find a systematic effect for some of these conditions. Furthermore, their effect may vary with season. It is also possible that the stratosphere may play a role for seasonal variations (Baldwin and Dunkerton, 2001;Thompson et al., 2002;Baldwin et al., 2003;Bro¨nnimann et al., 2004;Cagnazzo and Manzini, 2009).
The seasonal forecasting capability has improved over time (Kharin and Zwiers, 2003;Palmer, 2004;George and Sutton, 2006;Stocker et al., 2013), especially in the Tropics. Some of the improvement can be attributed to more advanced models, but part of the progress can also be linked to better model initialisation (Zhang and Frederiksen, 2003;Li et al., 2008) or a better understanding of phenomena such as teleconnections (Hawkins et al., 2002). Recent seasonal forecasting activity has also included prognoses for storm statistics (Compo and Sardeshmukh, 2004;Vitard et al., 2007). Murphy et al. (2010) reviewed the state on predicting decadal climate variability and change. They provided an overview of phenomena with decadal time scales, such as the NAO, the Atlantic Multi-decadal Oscillation (AMO), the Tropical Atlantic surface air temperature Gradient Oscillation (TAG), the North Pacific Oscillation (NPO), the Pacific Decadal Oscillation (PDO), and the Inter-decadal Pacific Oscillation (IPO). The state-of-the-art models are able to simulate key features of the most prominent of these modes, they argued, and at least on a global scale, simulated decadal temperature variability has been shown to be realistic. They also noted that precipitation variance in many latitude bands may be underestimated by climate models.
There have been some studies of the skill of downscaled climate data in the context of seasonal-to-decadal predictions. For instance, Gershunov and Barnett (1998) analysed the ENSO influence on intraseasonal extreme rainfall in the Contiguous United States. They looked at both observations and model results and found an ENSO response in the frequency of occurrence of heavy rainfall in the Southeast, Gulf Coast, central Rockies, and the general area of the MississippiÁOhio River valleys. Feddersen (2003) studied the predictability of seasonal precipitation in the Nordic region, and Feddersen and Andersen (2005) applied a statistical downscaling method to seasonal ensemble forecasts. Landman et al. (2009) examined the performance of dynamical and empirical downscaling for South Africa from seasonal forecasts, and Reason et al. (2006) studied the effects of the Atlantic Ocean state on subsequent seasonalto-decadal Southern African weather statistics.
We wanted to test whether it is possible to predict local precipitation statistics on seasonal-to-decadal time scales, given a good forecast for surface air temperature (SAT) and mean SLP. Here we use the terms 'forecast' and 'prediction' as synonyms. A dependency of the local precipitation statistics to large-scale SAT and SLP conditions implies a kind of downscaling, and it is possible to use empirical-statistical downscaling (ESD) methods to analyse observations for such inter-dependencies. We also assume that some large-scale features such as the North-Atlantic SAT and SLP can be predicted by coupled GCMs on a lead time from seasons to decades, due to the persistence and inertia in the world oceans. This study can be described as a 'perfect predictor experiment' (Maraun et al., 2015).
Seasonal-to-decadal forecasts can be made for a number of different aggregated statistics, such as mean, standard deviation, frequencies, number of events, as well as parameters for the 24-hour distributions. The exponential distribution has been used to describe temporal variations in the precipitation characteristics (Wilks, 1998;Burgueo et al., 2005;Benestad, 2013) although the observations suggest that 24-hour precipitation has a heavy tail bias (Benestad et al., 2012a(Benestad et al., , 2012b. The heavy tail bias with respect to the exponential distribution appears to be site specific and not time dependent. It has been more common to use the gamma rather than the exponential distribution to model 24-hour precipitation (Thom, 1958;Buishand, 1978;Wilks and Wilby, 1999;Wilby and Wigley, 2002). The gamma distribution is defined by two parameters: the shape and the scale parameters and is more appropriate for the analysis of the entire record of both wet and dry days. In this case, only the wet days were used to fit the exponential distribution. The mean associated with the gamma distribution is calculated as the product of the scale and shape parameters, whilst the exponential distribution is fully defined by the mean, which makes it more attractive.
We are not trying to analyse the extremes or estimate the return values in this paper. Extreme precipitation is normally analysed using Extreme Value Theory (EVT) that use the tail of the distribution for parameter fitting (Coles, 1999), however, return values derived from EVT usually assume a stationary probability density function (pdf). On the contrary, we assume that the pdf varies over time. The probability for precipitation amounts exceeding a given threshold is inferred from this assumption and the change in the entire pdf (the probabilities must add up to one).

Overview: methods & data
The simplest form of seasonal forecast for precipitation is a prediction of a total amount for a given season, or an average " x. However, precipitation has a number of different aspects, such as how often it rains, how much it rains on a rainy day, how long it is between each time it rains, and the duration of the rain event. The precipitation statistics differ between days when it rains (wet-day) and when it doesn't (dry day) because different physical processes and conditions are present. A complete picture of precipitation can be derived from parameters m, f w , the number of consecutive dry days (n cdd ), the number of consecutive wet days (n cwd ), type (rain, snow, hail), and the spatial extent; however, in this case we will limit the analysis to m and f w . The question is whether there are systematic dependencies between such phenomena and large-scale conditions such as particular SAT anomalies or SLP patterns. An estimate for the wet-day mean m can be related directly to extremes and higher quantiles if the distribution of the amounts is close to being exponential (Benestad et al., 2012a(Benestad et al., , 2012bBenestad, 2013). In this case, m can be regarded as a proxy for the precipitation amount. Extreme precipitation events can lead to floods caused by long wet spells as well as droughts which are products of extreme long dry intervals.
One practical problem is that the number of valid data points in a rain gauge record depends on the number of wet-days, and a wet-day frequency of Â0.3 implies approximately 30 valid data points for each season. This limitation is also a concern for monthly totals because data samples with a size of 30 (assuming it rains every day of a month) are low in terms of the central limit theorem (Wheelan, 2014) and the sample mean is subject to substantial statistical fluctuations. Furthermore, the stochastic nature of the sampling (a random point in space and a random interval in time) will be more pronounced for the higher quantiles and extremes. One solution could therefore be to calibrate statistical models on a larger data samples, such as using all days in a year for each estimate of m and mean values " x so that a wet-day frequency in the range 0.25Á0.30 yields approximately 100 valid data points (number of rain events). Another option could be to combine a given season from several consecutive years. One science question therefore is whether a statistical model for precipitation statistics is sensitive to the sample size. In other words, is a model calibrated on statistics derived from annual samples (365 days) valid for shorter and longer samples? The answer to this question would depend on whether there are different physical processes present that act on different time scales.

Data
The predictor data was taken from the 20th century reanalysis (Compo et al., 2011), based on monthly mean data and the ensemble mean. The reason for using this reanalysis was its long time span (1871Á2012) needed for the evaluation of the method. We used the ensemble mean of monthly mean SAT (0.995 sigma level) and mean SLP. The spatial domain was set to 508WÁ208E/408NÁ728N; however, five other tests with slightly different domains were also carried out (not shown). Furthermore, a slight variation of the spatial domain was also used for stations other than Bjørnholt, and the results shown here used a domain that extended 608W to 108E and 208S to 208N from the location of the respective station. These tests represented a crude and simple strategy for examining the sensitivity of the results to a subjective configuration choice and provided a means for identifying a skilful set-up. It is important to keep in mind the fact that sampling different set-ups constitutes a selection ('cherry picking'), and field significance should be carried out for hypothesis testing (Wilks, 2006). However, this was only true for Bjørnholt and not the other sites. Furthermore, our strategy did not constitute a proper optimisation of the method.
The two annual mean parameters describing the wet-day mean m and frequency f w were derived from daily rain gauge records taken from the climate archives of MET Norway (http://eklima.met.no) and from ECA&D (Klein Tank et al., 2002). The rain gauge records were selected based on the criteria that they included at least 130 yr, as long records were needed for the split-sample test. Figure 1 shows the location of the rain gauges. The 24-hour precipitation record from Bjørnholt in the forest north of Oslo was used as the primary data record. We let the capital letter X represent any random 24-hour precipitation, whereas the lower case symbol x is used to refer to a specific value. The exact definitions of m and f w are provided in the Appendix.

Experiment design
Empirical orthogonal functions (EOFs) (Lorenz, 1956;North et al., 1982;Wilks, 1995;Benestad et al., 2008) can be used to represent the large-scale SAT and SLP conditions, here referred to as predictors. A third predictor e s was derived from the SAT to represent the saturation vapour pressure according to the ClapeyronÁClausius equation (saturation evaporation pressure e s 010 (11.40Á2353/T) where T is temperature in degree Kelvin). The principal components (Strang, 1988) describe how the presence of the different large-scale EOF patterns vary over time, and were used as the independent variables (covariates) in an ordinary linear multiple regression. The regression involved a step-wise screening, based on the Akaike information criterion (Akaike, 1974). The regression-based ESD assumed that the data followed a normal distribution (see the Appendix).
The dependent variable in the regression represented the local precipitation statistics (m and f w ), and is referred to as predictand. Both predictors and predictands were aggregated on an annual basis for the calibration of the downscaling model. Both m and f w can be regarded as an estimate The location of the rain gauge measurements for more than 130 yr. The size of the symbols is proportional to the number of valid data and their colour is according to the correlation between observed and predicted m (inner part) or f w (outer part) over the independent period. of the mean that, according to the central limit theorem, converges to a normal distribution with sample size (f w is the mean of a series of zeroes and ones). We also checked how well m and f w conformed with the normal distribution though an analysis of the residuals. The precipitation data and the EOFs were also aggregated into seasonal and 5-yr (pentad) statistics respectively, to test whether a model trained on annual samples provided skilful predictions for smaller and larger samples. The purpose was to test the general validity of the model.
Common EOFs (Barnett, 1999;Benestad, 2001) were used for the model calibration to ensure that the seasonal and pentad EOFs corresponded to the same spatial patterns as the annual EOFs. The EOFs were computed independently for the large-scale atmospheric flow (SLP) and temperature anomalies (SAT), and the predictions were based on a step-wise approach where the ESD model first was calibrated on the e s EOFs for the downscaling of the wetday mean m and the SAT EOFs for f w respectively. The e s or SAT-based regression model was then used to predict the local quantities, and the resulting residuals were then used to calibrate another regression model that used SLP EOFs as predictors. The final predictions based on both e s /SAT and SLP were then synthesised by adding the predictions from the two steps. Limiting the EOFs to the first six leading ones imply both a selection of the largest spatial scales and reduces the risk of over-fit (Wilks, 1995), however, 12 EOFs were used for the locations other than Bjørnholt in order to improve the chances of finding a connection between local and large scales with a crude specification of predictor domain. A third step was introduced for the pentads to account for a possible link to the global mean temperature on the longer time scales (Benestad, 2013), where the residual of the downscaled pentad m from the SAT and SLP EOFs was subject to a final regression against the global mean temperature as in Benestad (2013).
The experiments and analyses were designed for the evaluation of forecasts against independent data, and hence the ESD model used the interval 1900Á1970 for calibration and independent data outside this interval for independent (out-of-sample) evaluation. The analysis was implemented in the R system for statistical computing (R Development Core Team, 2008, http://www.R-project.org/) and used the new R-package 'esd' (see the Appendix for more details).

Validation
The skill of the forecasts was assessed through the estimation of the Pearson correlation in the predicted and observed annual values m and f w . One reason for analysing annual mean values for these quantities is the expected characteristics of stochastic data of similar type: a normal distribution for the average of a continuous variable. Likewise, the skill of the predicted values for the annual wet-day 90th percentile q 90 was assessed in terms of its correlation against corresponding quantity from the observations. The skill of predicted probabilities was assessed by comparing the annual observed number of events with precipitation exceeding a high threshold (10 mm/day) against a predicted 90% confidence interval, based on probability estimates and a binomial distribution. One diagnostic involved comparing the residual to Gaussian white noise (distribution shape, trend, and lagged correlation). Figure 2 shows the downscaling results for the annual mean m at Bjørnholt, which was the same data record as Benestad and Melsom (2002) found a teleconnection link in the North Atlantic. The upper part of the figure shows the observations and predictions. For the independent data, the correlation was 0.27 (with 90% confidence interval of 0.01Á0.49), and for the entire period, the correlation was 0.38 (0.22Á0.52). The predicted values in the calibration interval are shown in faint colours, and the residual is shown in the lower part of the figure. The inserted diagrams show the auto-correlation function (ACF) and the normal quantileÁquantile plot (qÁq normal) for the residuals. One indication that the analysis has identified all dependencies is when the residuals have low auto-correlation and are close to being normally distributed (white noise). Hence, these results suggest that the downscaling accounts for most of the connection to larger spatial scales. Table 1 (first column) shows similar scores for some European stations whose locations are shown in Fig. 1, where the colour scale of the inner part of the symbols indicates the correlations for m. For a number of stations, the evaluation indicated no skill, however, there were also some stations with higher correlations, the highest being 0.42 for Strømsfoss sluse. Both the estimated and downscaled m for Bjørnholt suggest a slight increase during the last evaluation period since 1960. Figure 3 shows the results from a similar analysis as in Fig. 1, but for f w rather than m. The correlation between predicted and observed annual wet-day frequency was 0.48 for the out-of-sample data, with a 90% confidence interval of 0.26Á0.66. Hence the results were statistically significant at the 5%-level. The ACF gave no indication of persistence in the residual and the qq-plot showed that the residuals were close to being normally distributed. However, the residual had a number of pronounced jumps ('spikes') between intervals where the values varied to a much smaller degree and with a somewhat more persistent character. Again, the inspection of the residual suggests that the downscaling captured most of the large-scale dependency in f w .

Results
Since 1970, both the estimated and downscaled f w suggest a slight increase, in other words, there has been more rainy days. The first column in Table 2 indicates that the correlations for f w were in general higher than for m, often exceeding 0.50. The skill scores for f w are indicated by the colour of the outer circle of the symbols in Fig. 1. The highest score was 0.63 for Halden and Hohenpeissenberg. Figure 4 shows a set of maps indicating the regional patterns in surface temperature (upper) and mean sea-level pressure (lower) associated with high values for m (left) and f w (right) at Bjørnholt. The teleconnection patterns associated with m exhibited a temperature gradient with cold conditions over the British Isles and the Bay of Biscay and warm conditions over northern Norway. The corresponding pattern for the SLP was characterised by low pressure to the west of the British Isles. However, the patterns associated with high frequency of wet days differed from those related to high amounts: cold conditions south of Greenland and west of Britain and warm over the centralÁ northern European continent.
Figures 2Á4 presented results for annual statistics. One interesting question is whether models calibrated on annual samples of m can be used to predict variations on seasonal time scales. In this case, the definition of 'independent' was the seasons for the years which were not used for the calibration, and category 'all' also included the seasons for those years on which the annual aggregates were calibrated. Here we report the skills for all years, since the seasonal aggregates for the calibration period were not used to train the model. Figure 5 shows the results of such a test, however, the predictions for the seasons had little skill for m during winter (0.15), but higher correlations for spring, summer and autumn (0.25, 0.29, 0.39, respectively; Table 1). Seasonal differences have also been noted in earlier studies (Wilby and Wigley, 2000;Hanssen-Bauer et al., 2003). One difference between the winter and the other seasons is that more of the precipitation is in the form of snow. The predicted values were slightly biased with respect to the observations and did not capture the peaks, but there were only minor differences between the mean levels during the different seasons. Table 1 suggests that findings for Bjørnholt of generally low-to-moderate seasonal correlation skills for m (columns 4Á7) and high for f w (Table 2) are also fairly typical for the other European stations. A similar analysis for f w (Fig. 6) suggested even higher scores. Downscaling calibrated for the annual mean f w at Bjørnholt gave a correlation of 0.67 for winter, 0.64 for spring, 0.55 for the summer, and 0.76 for the autumn season. For the pentads (5-yr aggregated values), the correlation between predictions of m and the wet-day mean derived from the observations was 0.18 for all years ( (0.22,'0.25) and not statistically significant (Fig. 7). It is worth bearing in mind that the number of data points in the correlation analysis was lower than for the annual analysis and makes a difference in the estimation of the confidence intervals. The ACF gave no indication of persistence in the residual, but rather a slight lag-one anti-correlation. The qq-plot for the residuals after the regression analysis against the (regional) air surface temperature and mean SLP showed a divergence from the diagonal in the lower tail. The lack of skill on the 5-yr timescale was fairly typical for those sites with sufficient record lengths ( The column 'q 90 ' provides the correlation scores for the predicted q 90 as in Fig. 8. The correlations for the seasons and pentads were for all years. column). Similar exercise for f w rather than m indicated a reduction in the skill for the wet-day frequency on the 5-yr timescales (Table 2; last column). Figure 8 shows a comparison between the annual wetday 90th percentile q 90 from the observations, the corresponding values estimated from m assuming an exponential distribution, andq 90 predicted from SAT and SLP. The latter prediction took m from Fig. 2 and solved for q 90 ¼ À lnð1 À pÞlðSAT; SLPÞ. The year-to-year correlation between the downscaled and observed values was 0.28, and statistically significant at the 5%-level. In particular, the downscaling did not capture the magnitude of the major peaks in q 90 ; however, the value of q 90 derived directly from the observed m gave a closer reproduction of the multiyear variations (grey; correlation of 0.76). The q 90 was chosen rather than q 95 , as the former was less influenced by statistical fluctuations (f w Â0.3 gives about 120 data points per year) and the annual q 95 was strongly affected by a single year. In Table 1, the corresponding correlations are presented for the other European stations (column 3), and the general picture was a moderate and positive correlation between downscaled and observed q 90 .
One way to use the downscaled results for forecasting is to estimate the number of events exceeding a certain threshold value each year. The method predicts the probability Pr(X !x 0 ) based on the exponential distribution and f w , and the confidence interval for the number of events can be estimated according to the binomial distribution (see the Appendix). For each year, a 90% confidence interval for the predicted number of events was estimated according to the predicted probability, which was compared with the observed number of events (henceforth 'counts'). We would expect some of the counts to fall outside the confidence interval even if the predicted and observed data have an identical distribution, and the sum of the number of years out of a total of N years with the observed counts outside the predicted interval (n 05 ðtÞÁn 95 ðtÞ) is also expected to follow a binomial distribution for perfect predictions. Figure 9 shows a comparison between predicted number of precipitation events exceeding 10 mm/day [a threshold used by road authorities in Norway (Frauenfelder et al., 2013)] and the counts of observed events. The downscaled results reproduced the realistic multiyear variability with a correlation between the predicted annual median count n m (t) and observed count n o (t) of 0.6. However, the probability that the predictions conformed to the same statistical distribution as the 130 yr record was only 1% (i.e. different at the 5% statistical significance level). The predicted counts were outside the confidence interval in 22 of 130 yr, compared to a 90% confidence interval of 8Á19 according to the binomial distribution. In other words, the predicted annual count of annual heavy-precipitation events based on the downscaling exhibited a wider distribution than the observed statistics with a high proportion of values outside the 90% confidence interval (p01%).
On the other hand, the same analysis applied to the annual wet-day means and frequencies taken directly from the observations suggested a narrower distribution and gave only two cases with counts outside the 90% confidence The column 'n(X!10mm)' shows the correlation between the median number of days with rainfall greater than 10 mm/day as in Fig. 9. There was a bias for some stations that gave a shift in the predicted 90% confidence interval even with moderately high correlations. The correlations for the seasons and pentads were for all years.
interval, significantly lower than expected (8Á19). The assessment of skill should span over a range of threshold values, as shown in Fig. 10, displaying the probability of N being anomalously high or low (thin lines with symbols) for a range of different thresholds (x-axis). For thresholds x 0 in the range 20Á75 mm/day, the predicted results suggested a good correspondence with a binomial distribution with p 00.1. The number of observed counts outside the predicted 90% confidence interval (axis on the right), however, suggests that the value estimated from the observed annual mean m and f w was almost always within this range, and hence, the estimated confidence intervals were too wide (the predictions were 'underconfident'), with the exception of low threshold values (x 0 B10mm/day).

Discussion
The intention here was to emphasise the wet-day mean, as it is related to the statistics for variables that approximately follow an exponential distribution. Another point was to link the downscaled wet-day mean and frequency to the higher-order statistics, such as quantiles, number of days with heavy precipitation, and probabilities of exceedance. One explanation why the correlation skill for q 90 was lower than for n(X!10mm/day) may be that the latter was derived from both f w and m, that PrðX > x 0 Þ / f w , and that the downscaling of f w was associated with fairly good skill. The estimate of m also provided a reasonable approximation for the shape of the pdf on which the results were based. A simple test was carried out where either m or f w were constant and set to their mean value respectively (see the supporting material). The results of this test suggested that both m and f w contribute to the estimate of the probability, but that f w has a stronger influence on the results.
Another interesting question is why m seems to be more difficult to predict than f w . It was in general more difficult to establish a link between the large-scale climate conditions and local m, and the skill scores for m were low for a analysis (Wilks, 1995) also suggests a more homogeneous spatial pattern in f w than in m. The scores for f w in Fig. 1 also suggest generally good results in the north and reduced skill in the south and around the Mediterranean. Figure A2 in the supporting material shows a similar analysis as in Fig. 1 but for all years, and the inclusion of the dependent data clearly improves the scores. Both non-stationarity and varying data quality with time (Wijngaard et al., 2003) may explain some of these differences. The physical processes and conditions involved differ for m and f w (Linderson et al., 2004), where m is expected to bear the signature of the cloud microphysics and may be more strongly influenced by summer-time convection and associated temporal and spatial scales. A plausible physical picture is that the connection to large-scales involves mechanisms through which humidity is advected with a geostrophic flow caused by pressure differences. Another factor is the dependency of the rate of evaporation on the surface temperatures, with increased evaporation due to higher temperatures in accordance to the ClapeyronÁClausius equation. In addition, large-scale temperature gradients can affect the baroclinic stability and hence the genesis as well as the course of low-pressure systems, often associated with precipitation.
If higher m in summer were due to more convective events, then the results for Bjørnholt may suggest that it may to some extent be possible to predict the aggregated values for m from large-scale temperatures, vertical stability, or humidity. Another question is whether the low skill for m for Bjørnholt in winter can be attributed to more snow and less liquid rain, for instance under-catch (Wolff et al., 2014). Another explanation is that low values of m during winter means that the calibration with its annually aggregated value emphasises the warmer seasons with typically higher values. The predictor map for m ( Fig. 4; left panels) suggests that the higher values of the wet-day mean at Bjørnholt were associated with a low-pressure system over the British Isles and a southwestÁnortheast temperature gradient.
The corresponding results for the other stations gave a mixed picture, with some locations exhibiting similar scores as for Bjørnholt (green symbols in Fig. 1) Fig. 7. Same as Fig. 2 but for pentads. The residual does not suggest any persistence and is close to being normally distributed, after a third regression against the global mean temperature was included. There is a modest correlation between the predictions and observations, but the results do not indicate that a model calibrated on annual data is valid for multi-annual time scales.
indicated low skill (red symbols in Fig. 1). There may be several reasons for the geographical variations in skill, such as the use of a fixed predictor domain size and the different locations being exposed to different prevailing winds and weather types. Another reason may be that the ECA&D data set is not homogeneous (Wijngaard et al., 2003). Furthermore, the experimental set-up may not have sufficiently captured the geographic variations in the link between small and large scales, and including 12 EOFs is associated with a higher risk for over-fit than including only six. The marginal distribution for 24-hour precipitation tends to have a fat tail compared to that of the exponential distribution (Benestad et al., 2012a(Benestad et al., , 2012b, and hence the exponential distribution will tend to under-estimate the probabilities associated with extreme precipitation. More accurate results are obtained with proper EVT (Coles, 1999). The Appendix provides a crude comparison between return-values derived based on EVT and the framework based on the exponential distribution presented in this paper. If the imprecision associated with the latter is acceptable, then crude estimates for daily precipitation return periods can be derived from downscaled results for m and f w .
It is also questionable whether the 20 C reanalysis used in these 'perfect predictor experiments' provide a sufficient representation of the past SAT and SLP. Changes in the surface network of observations may have an effect on its degree of homogeneity, and the lack of upper-air observations and imperfect estimates of historical sea surface temperatures imply that the reanalysis has some limitations. We have not addressed those here, but such caveats mean that the analysis presented here is not quite equal to 'perfect predictor experiments'. Different experiments can shed light on processes connecting the large-scale conditions and the local rainfall statistics, such as 'pseudo-reality'-based ones. Such a study is beyond the scope of this paper, which addresses the question whether we can make actually useful forecasts of precipitation statistics seasons to decades ahead, given large-scale patterns in surface temperature and mean SLP. A 'pseudo-reality' set-up could also by-pass problems such as imperfections in the reanalysis data.
A concern for seasonal-to-decadal forecasting is whether GCMs are able to skilfully forecast the large-scale patterns in SAT and SLP over the North-Atlantic on a seasonal-todecadal time scales, and the greatest bottle neck in terms of providing such forecasts for the North-Atlantic region seems to be due to the limited understanding of the role of various physical processes, the presence of complex chaos, and whether the GCMs actually capture crucial precursory signals. The coarse spatial resolution in the ocean models used in the GCMs may limit their ability to adequately represent ocean currents and associated heat transport. Furthermore, Benestad et al. (2010) carried out a number of simulations with different Arctic sea-ice conditions and found that the temperature response was pronounced for Europe but not systematic, suggesting a strong presence of non-linearities. European teleconnection mechanisms may involve dynamical processes which are hard to forecast, such as the Arctic vortex, the NAO, jet streams, and storm tracks. The roles of oceanic heat distribution and currents are still not well known. In this study, the same predictor set-up was used for all the European stations, however, the skill will probably be higher for some locations with a different predictor choice. Seasonal forecasts tend to have more skill in the Tropics, e.g. for the ENSO, and the same methods used here may provide useful estimates of probabilities for heavy rainfall seasons ahead. Fig. 9. Predicted and observed number of intense precipitation events, defined by exceeding the threshold value x 0 010mm/day. Symbols represent the counted events based on rain gauge data, grey shading the 90% confidence interval implied by m and f w derived from the observations, and red shading the 90% confidence interval derived through the downscaling exercise. The confidence intervals were derived taking the 5 and 95% percentiles from a binomial distribution where the probability for exceeding the 10 mm/day was taken from eq. (A4). The filled symbols mark the years when the number of events was outside the 90% confidence interval.

Conclusions
We show how ESD can be applied to parameters that traditionally have not been commonly used in downscaling, such as predicting probabilities and number of intense precipitation events. Furthermore, we provide a contribution to a more 'comprehensive evaluation' (Murphy et al., 2010) to measure the compound effect of downscaling biases in terms of the number of events predicted over time. However, we do not attempt to assess the compound skill of GCM and downscaling here.
The results presented here suggest that it is possible to estimate local precipitation statistics with some predictive skill given large scale SAT and SLP conditions. The wetday frequency in particular seems to bear a strong dependency to the large-scale atmospheric conditions, although a weak link was also found for the wet-day mean m which is a proxy for the precipitation amount. A prediction of precipitation statistics on seasonal-to-decadal time scales based on these results will be probabilistic with approx-imate estimates, and hence the presence of uncertainties will linger on. The precipitation amount (m) was sensitive to time scale in certain circumstances and hence sample size. A possible reason for this may be the presence of different physical processes with different intrinsic time scales, most likely in the large-scale predictors and seasonal differences between processes producing rain and snow. Hence, the different time scales in the predictors and the predictands may reflect different physical processes in terms of the rainfall amount on a wet day. The weak skill for winter may also be due to the generally low precipitation amounts during winter and that the annual wet-day mean emphasises the season with higher rainfall. The wet-day frequency was generally not sensitive to seasonal or annual time scales, which may suggest that similar physical mechanisms responsible for the wet-dry occurrence is present on seasonal and annual time scales. The analysis for 5-yr time scales failed to find dependencies between large and small scales for the chosen predictors. The results suggest that models calibrated on annual data samples may in some cases provide skilful forecasts on seasonal time scales only.

Acknowledgements
This work has been supported by the Norwegian Meteorological Institute and SPECS (EU Grant Agreement 3038378). We also like to thank Reidun Gangstø, Inger Hanssen-Bauer and Oskar Andreas Landgren for reading and commenting on the manuscript.

Appendix Definitions
The wet-day mean is the mean value of all days where the amounts exceed the threshold value x 0 01mm/day: where N is the rain gauge sample size (number of days) and the total number of wet days is H is the Heaviside function. The wet-day frequency is the number of days where the amounts exceed the threshold value x 0 01mm/day divided by the total number of days: Hðx > x 0 Þ=N (A2)

Method in detail
Daily precipitation data were sorted into wet (X!1mm/ day) and dry days as in Benestad et al. (2012aBenestad et al. ( , 2012b and Benestad (2013), where the precipitation statistics were characterised by the annual wet-day mean m and the annual The right axis and the thick line indicate the number of observed events that were outside the predicted 90% confidence interval for a range of different threshold values x 0 (x-axis). The left axis and thin lines indicate the probability associated with the observed number of intense wet events outside the predicted 5Á95% annual interval estimated according to a binomial distribution. The number of years with events outside the confidence interval are counted and then the probability is estimated according to the binomial distribution, the sample size (number of years), and a probability of 10% for being outside the predicted 90% confidence interval. The red curves show results from the downscaling exercise whereas the grey curves show the results derived from m and f w estimated directly from the rain gauge data.
wet-day frequency f w . The ordinary all-day mean precipitation is the product of the wet-day frequency and mean " x ¼ f w l.
The wet-day 24-hour precipitation was taken to be approximately exponentially distributed, whereas the variable describing a wet or a dry day was expected to follow a binomial distribution (o # [0,1]). In contrary to the variables, statistical parameters describing the annual mean wet-day mean m and the annual mean wet-day frequency f w follow a normal distribution according to the central limit theorem (Wheelan, 2014), as seen in the qq-plots in Figs. 2 and 3. The design of the experiment where regression analysis was applied to annually aggregated statistics was motivated by the convenience of an expected normal distribution according to the central limit theorem.
The experiments used regression-based ESD, involving step-wise screening applied to the principal components describing the EOFs, and the regression model used in the ESD was where x SAT (t) refers to the principal components associated with the EOFs describing the annual mean SAT, and o is an error term describing the residual. Equation A3 was repeated after m was replaced by the residual o and x SAT (t) with the principal components describing the SLP (x SLP (t)) and the final downscaled results were the sum of the two steps.
On a separate account, we wanted to examine the possibility to predict reliable probabilities for extreme precipitation events, given a skilful prediction of m and based on the prediction of pdfs for 24-hour precipitation. Let the 24-hour precipitation be approximately represented by an exponential distribution, for which the mean value is l ¼ 1=k and variance r 2 ¼ 1=k 2 ¼ l 2 . If the wet-day precipitation follows an exponential distribution, then any quantile q p is directly specified in terms of the wet-day mean m: We also can solve for the value of any return value x s analytically if we know the fraction of wet days f w and the data actually follows an exponential distribution (we know that this is not the case here, as the tail is thicker than for the exponential distribution): 1 s ¼ f w PrðX > x s Þ ¼ f w e Àx s =l ; therefore, We can then estimate the probability of exceedance X > x s for a year and the number of events according to a binomial distribution. We let the probability for an event X > x s be p ¼ f w e Àx s =l , and use the binomial distribution to estimate the number of events k estimated for a given interval with length n: C k n p k ð1 À pÞ ðnÀkÞ . It is possible to compare x s from the equation above with the return values estimated from proper EVT using general extreme value distribution (GEV) (Coles, 1999) (Fig. A1). The return values for Bjørnholt was estimated using the Rpackage 'evd' and fgev which subsequently were compared with x s .

The R-package 'esd'
The wet-day mean m and frequency f w were derived from a set of methods provided by the R-package 'esd' (Benestad et al., 2014) developed under the R core programming tool (R Core Team, 2014): muBÀas:annualðy; FUN ¼ 0 exceedance 0 ; threshold ¼ thresholdÞ fwBÀas:annualðy; FUN ¼ 0 wetfreq 0 ; threshold ¼ thresholdÞ The 'esd' package is made freely available by the Norwegian Meteorological Institute (MET Norway) and is for use by  Fig. A1. A comparison between return values derived from GEV and eq. (A6) for Bjørnholt suggests that the exponential distribution has a tendency to under-estimate the return-levels. Still, the 'error' associated with the estimates based on the simpler expression for x s are less than 10% compared with the GEV-based estimates (not shown). Hence, a rough estimate for return-levels can be obtained by downscaling the wet-day mean m and frequency f w . the climate community. It was primarily built for statistical downscaling of climate variables and parameters from global (e.g. ENSEMBLES, CMIP3/5 model results) and/ or regional climate models (EURO-CORDEX), and has recently been extended to deal with data I/O, statistical analysis and visualisation. It consists of (1) retrieving and manipulating large samples of climate and model data from various sources, and (2) computing Empirical Orthogonal Functions (EOFs), PCA, and Multi-Variate Regressions (MVRs) using very simple R commands and procedures (See R Script in the supporting material). The esd package has been built with the emphasis of traceability, compatibility, and transparency of the data, methods, procedures, and results. As the 'esd' package has been built on R system, it inherits from the large number of R built-in functions and procedures. Finally, it can be easily tailored to various climate analysis with few adaptations such as the one described in this paper. An example of programming commands is described hereafter including a four-steps procedure. The first step consists of retrieving daily precipitation from the MET Norway archive using the function 'station' as the following yBÀstationðsrc ¼ 0 metnod 0 ; param ¼ 0 precip 0 Þ The second step consist of getting global climate model output and reanalysis data using the function 'retrieve' as the following gcmBÀretrieveð 0 any global climate model 0 Þ; reaBÀretrieveð 0 any reanalysis 0 Þ: The third step consists of computing the common EOFs using the function 'EOF' as follows ceofBÀEOFðcombineðrea; gcmÞÞ: Finally, the downscaling of y from the common EOF 'ceof' is achieved using the function 'DS' as the following dsBÀDSðy; ceofÞ: