Calibrating regionally downscaled precipitation over Norway through quantile-based approaches

Dynamical downscaling of earth system models is intended to produce high-resolution climate information at regional to local scales. Current models, while adequate for describing temperature distributions at relatively small scales, struggle when it comes to describing precipitation distributions. In order to better match the distribution of observed precipitation over Norway, we consider approaches to statistical adjustment of the output from a regional climate model when forced with ERA-40 reanalysis boundary conditions. As a second step, we try to correct downscalings of historical climate model runs using these transformations built from downscaled ERA-40 data. Unless such calibrations are successful, it is difficult to argue that scenario-based downscaled climate projections are realistic and useful for decision makers. We study both full quantile calibrations and several different methods that correct individual quantiles separately using random field models. Results based on cross-validation show that while a full quantile calibration is not very effective in this case, one can correct individual quantiles satisfactorily if the spatial structure in the data are accounted for. Interestingly, different methods are favoured depending on whether ERA-40 data or historical climate model runs are adjusted.


Introduction
The intensification of climate research over the past decade produces a steadily increasing number of data sets combining different global circulation or earth system models, CO 2 emissions scenarios and downscaling techniques.Turning future projections into robust and reliable information available at a local scale is imperative for the successful modelling of impacts of climate change in nature and society.The comprehensive financial and safeguarding challenges of mitigation and adaptation call for thorough validation, improvement and extensions of current downscaling techniques.
The comparison of climate models to weather data raises interesting statistical problems.For a statistician, the most natural definition of the climate is that it is the distribution of weather (and other earth system variables) over multi-decadal timescales (Smith et al., 2010;Guttorp, 2014).A climate model (general circulation model or more generally earth system model) describes the distribution of observable variables based on physical principles.Because some of the processes (e.g.convection, clouds) occur on scales smaller than the large grid squares needed to approximate a solution to the Navier-Stokes equations, such processes are often calculated using simple approximations (or parameterizations).
A multitude of models have emerged for projection of future climate change at different spatial (and temporal) scales.Essential in the process of going from the coarse resolution of the global models to finer spatial scales are the regional climate models (RCMs).Such models propagate information from a coarse-scale model along the boundary of a higher-resolution area of interest, using a more detailed terrain description, model solutions using finer resolution, and D. Bolin et al.: Calibrating regionally downscaled precipitation over Norway improved physical process parameterizations.The boundary conditions may be computed either from a global weather model forced with updated historical observations to calculate consistently the state of the atmosphere (reanalysis), or from a global climate model.A regional model using reanalysis boundary conditions is sometimes said to be run in "weather forecasting mode", and is the closest one can hope to get to observed weather using a regional climate model.Maraun et al. (2010) described approaches to downscaling precipitation.
One major purpose of regional climate models is to give end users such as stakeholders and decision makers a representation, preferably a reliable projection, at a practically useful spatio-temporal scale, of future weather.In the insurance industry, for instance, the interest lies in high precipitation projections under various possible future scenarios to assess the changing risk of damages to buildings or flooding (Scheel and Hinnerichsen, 2012).Typically, scenario runs are built from the regional model forced by global coupled ocean-atmosphere or earth system model runs.The question then becomes how reliable these regional models are, at the scale needed by the actual effect study.For example, to understand patterns of risk for the insurance of buildings, precipitation at meso-to local-scale level is needed.Orskaug et al. (2011) compared precipitation from the HIRHAM regional model (Bjørge and Haugen, 1998), run over Europe and forced with ERA-40 reanalysis (Uppala et al., 2005) boundary conditions, to a gridded precipitation product for Norway (Jansson et al., 2007) on a 25 × 25 km 2 scale.They used a variety of statistical measures for comparing the two data sets.The regional model output was found to describe low levels of precipitation fairly well, but failed to reproduce large quantities.Maule et al. (2013) found that most of the regional models in ENSEMBLES give fairly accurate descriptions of drought indices and other functions of low precipitation regimes.These findings, together with the need for representative scenarios called for by most impact studies, serve as a motivation for improving the local-scale description of extreme future climate precipitation.
It has long been understood that regional models tend to be regionally biased in terms of precipitation (e.g.Christensen et al., 2008;Monjo et al., 2014;Mishra et al., 2014).Bias correction is an approach that attempts to adjust (statistically or otherwise) the climate model output (regional or global) to make it closer to observed data for historical runs.The idea is then that applying this bias correction to future simulations should also provide more realistic projections.Kerkhoff et al. (2014) develop a framework for assessing the bias correction, and the assumptions needed to apply such corrections to projections.They focus on temperature data, and can therefore assume normal distributions, which is not appropriate for precipitation.
There are a variety of bias correction methods in the literature (Maraun et al., 2010, contains a review).The simplest is a multiplicative correction to make the empirical means of data and RCM output agree (Lenderink et al., 2007).Some authors (e.g.Schmidli et al. (2007) prefer to make the correction only to the mean of precipitation on rainy days.An intermediate approach adjusts the coefficient of variation of data and model output (Teutschben and Seibert, 2012).More advanced methods try to match quantiles, either by fitting gamma distributions (with point mass at zero) to data and models (Piani et al., 2010) or non-parametrically using full quantile mappings (Themeßl et al., 2011).The bias corrections are typically done grid square by grid square, without an explicit spatial model for between grid square dependence.Quantile corrections (or smoothed estimates thereof) typically are found superior in comparative assessments (Gudmundsson et al., 2012;Räty et al., 2014;Fang et al., 2015).
In this paper we consider approaches to statistical adjustment of the regional model output, obtaining a calibrated product that is closer in distribution to the observed data than the original output.We first investigate the Doksum shift function (Doksum, 1974), which makes a full quantile calibration, as the basic tool for adjustment.Next, we restrict ourselves to less ambitious models that correct individual quantiles separately.Considering gridded data products covering Norway, we build transformations either separately for each grid cell, or via models that incorporate some kind of spatial structure.The models are fitted to a training set of downscaled ERA-40 data, and then used to correct downscaled ERA-40 on a test set.We also try to correct downscalings of historical climate model runs using the same transformations built on downscaled ERA-40 data.Unless such calibrations are successful, it is difficult to argue that scenariobased downscaled climate projections are realistic and useful for decision makers.
The paper continues as follows.In Sect. 2 we present the various data sets used in the analysis.Section 3 deals with using the shift function to do full quantile bias correction, while Sect. 4 focuses on bias correction of individual quantiles.Section 5 discusses the potential use of the methodology in assessment and uncertainty quantification of regional climate models.
All code needed to run the analysis on the data are found at Bolin et al. (2016).

Data
The data used in this study constitute 40 years of daily precipitation values for the Norwegian mainland, covering the period 1961 to 2000.The data set is twofold: one part consists of dynamically downscaled model data (ERA-40 reanalysis and climate model), and the other is a gridded product based on in situ observations.A more thorough description of the data are given in Orskaug et al. (2011).

ERA-40 reanalysis
Reanalysis data express the best, physically consistent, estimate available for the historical state of the atmosphere.They are formed in retrospect from feeding various sources of past meteorological observations into a current meteorological forecast model.ERA-40 reanalysis data (Uppala et al., 2005) are a product of the European Centre for Medium-Range Weather Forecasts in the UK.
Downscaled ERA-40 data are collected from the ENSEM-BLES project website (http://ensemblesrt3.dmi.dk)(Christensen et al., 2010).Gridded large-scale ERA-40 data along the boundary of an integration area covering most of Europe are dynamically downscaled to weather variables on a grid with a spatial resolution of 25 × 25 km 2 , which amounts to 777 grid cells covering the Norwegian mainland.The downscaled ERA-40 reanalysis data will be referred to as dERA40 in this paper.The downscaling is done by the Norwegian Meteorological Institute using their HIRHAM Regional Climate Model (Bjørge and Haugen, 1998).

Observation-based data
Precipitation is measured daily at stations irregularly distributed across Norway.Based on all observations of precipitation available at every time step, high-resolution precipitation grids (1 × 1 km 2 ) are estimated applying a Delaunay triangulation (Jansson et al., 2007).The interpolated precipitation values are adjusted locally by taking the deviations between triangulated station elevations and ground heights as given by a detailed terrain model into account.Also, prior to the interpolation, observed precipitation is corrected for exposure-dependent undercatch due to wind loss (Førland et al., 1996).
In order to compare the two data sets, the 1 × 1 km 2 observation grid was aggregated into the larger 25×25 km 2 grid of dERA40.This was obtained by collecting all 1 × 1 km 2 grid cells with centre points within a dERA40 cell, and taking their average as a representation of the measured precipitation inside that grid cell.We use the abbreviation OBS for this data set.

Climate model data
The global Bergen Climate Model, BCM, data set (Furevik et al., 2003) is downscaled by the Norwegian Meteorological Institute using the same HIRHAM Regional Climate Model as for dERA40 and also collected from the ENSEM-BLES project website.This downscaled climate data set has the same spatial resolution of 25 × 25 km 2 .The downscaled BCM climate model data set will be referred to as dBCM in this paper.

Full quantile calibration of Norwegian precipitation
The results of the evaluation of the regional model (Orskaug et al., 2011) underlines the need for enhanced climate projections at a local scale.Discrepancies between the distributions of observed and downscaled precipitation exist for the whole range of data, suggesting that a full quantile calibration function is needed.In Sect.3.1 we address this issue using a calibration that will make the model data distribution closer to that of the observed data.

Distributional calibration using Doksum's shift
We characterize the transfer function between two distribution functions, in our case those of a model and of observations, using Doksum's shift function (Doksum, 1974).To define this function, consider data from two distributions, F and G, and let (x X is a random variable with cumulative distribution function F), it is easy to see that X + (X) ∼ G.In other words, (x) measures how much the distribution F needs to be shifted at a value x in order to coincide with the distribution G.The shift function can be estimated using empirical distribution functions for F and G: where F and G are the empirical cumulative distribution functions of F and G, respectively.The empirical cumulative distribution function is a step function: where I (A) is the indicator of event A, and (x 1 , . .., x n ) are observations of independent and identically distributed real random variables distributed according to F .We follow the standard statistical notation where X i stands for a random variable and x i for its observed value.
If the shift function is constant, it means that there is only a difference in location between the two distributions (and particularly if that constant equals zero there is no difference between the distributions).If it is linear, a location-scale transformation is implied.
Assume next that a region is divided into S grid cells.For grid cell i, i = 1, . .., S, let X i denote downscaled model precipitation and Y i observations.Let F i be the cumulative distribution function of X i and G i that of Y i .Assume further that we have downscaled model output x it and observations y it for days t = 1, . .., T (using a common T implies no loss of generality; should the number of data points for F and G rather be T F and T G , respectively, those are used instead).Our interest lies in distributional coherence between the downscaled model data and the observations, rather than daily correspondence between x it and y it .Calibration of a new value x uncal it drawn from the same distribution F i but with t ∈ 1, . .., T is done by adding its Doksum shift, showing that this calibration is indeed a full quantile calibration.

Transferability of calibration
Assume that we want to apply the calibration in Eq. ( 1) to data from another data set, i.e. to z H it not necessarily distributed according to F and where t may or may not overlap with 1, . .., T .A typical example would be extending the calibration established for a re-analysis to a historical climate model run.As before, we use data from F and G to calculate empirical distributions F i and G i respectively, where F i is estimated from x it , t = 1, . .., T and G i from y it , t = 1, . .., T .We then correct the new data set, z H it , by

Shift function calibration results
In Orskaug et al. (2011) it was shown using several criteria that the dERA40 and OBS data sets lack agreement.A detailed comparison of specific local features showed that the global disagreement was due to poor agreement for high quantiles.As mentioned in Orskaug et al. (2011) the dayby-day correlation is partly lost when downscaling the ERA-40 data; hence we compare distributions rather than using day-by-day test measures.We test the calibration models described in Sects.The dERA40 data are calibrated using Doksum's shift function as described in Sect.3.1.In particular, for a specific x uncal it from the test data set, its calibrated value x cal it is calculated from Eq. ( 1) where F and G both are estimated based on the training data.
Assuming independence between the test statistics for different grid squares, if all null hypotheses are true, we would expect about 39 spurious significances at 95 % confidence level in a plot with 777 grid cells.We have carried out the same kind of comparisons as in Orskaug et al. (2011), but here we only report the Fisher test of the 95th percentile.Figure 1 shows substantial amounts of rejections (74 %) in the uncalibrated dERA40, with an improvement (24 % rejections) for the calibrated data, and a deterioration (48 % rejections) for the downscaled Bergen climate model.Things are worse for the Kolmogorov-Smirnov test, in particular for the climate model data (77, 18 and 79 %, respectively).Since we are estimating the calibration from the training data, we do not expect to get only 5 % rejections in the test set.Furthermore, spatial dependence also affects the rejection rates.
The main reason for the difficulty of making a full quantile calibration is that the bulk of the distribution is concentrated around very small precipitation values, and the Kolmogorov-Smirnov statistic tends to focus on these well-estimated parts of the distribution, where very small differences in amounts are the reason for rejection.It would be natural to hope to use a spatial model to borrow strength from nearby grid squares.However, the high variability in the quantile correction for large values (occurring since there are relatively few high observations of precipitation) makes it difficult to fit a spatial functional model.Instead we will focus on calibrating directly quantities of higher interest for adaptation, namely high quantiles, where the full calibration did somewhat better.

Calibrating individual quantiles
We now focus on calibrating a fixed quantile, q, over a time period of r days.Because of the cross-validation comparison we will perform in the next section, we denote these time periods as folds.An observation Y q i,k is the qth empirical quantile at location s i and fold k.That is, where G −1 i,r(k−1)+1:rk (q) is the left inverse of the empirical density function made from observations corresponding to the days of fold k, Y r(k−1)+1:rk .To link the downscaled precipitation to Y q i,k , we construct The goal is now to predict the spatial field Y q i,k using a calibrated version of X q i,k , where the calibration is estimated using all data that are not in fold k, that is, Y q i,−k and X q i,−k , where the subscript −k stands for all folds except the kth.
We will use the notation Y q k for the vector [Y q 1,k , . ..Y q S,k ], and similarly for X q k .Further, we will denote a diagonal matrix with diagonal entries b by diag(b).
As a baseline method, we use the empirical quantile at each location Y q i,−k as the predictor of Y q i,k .We will later denote this as Model 0, and it should be noted that this prediction does not use the downscaled data.However, it should be a reasonable prediction assuming that the climate is stationary.
As a reference model, we use the smoothing spline method that performed well in Gudmundsson et al. (2012).We will later denote this as Model Ref.The method matches (all) quantiles of the model output to (all) quantiles of the observations using a cubic spline regression, for days with non-zero precipitation.

Model 1: linear regression
As a first method for doing the calibration, we do linear regression with X q i,k as covariate.Since the model is for precipitation data, which is asymmetric and positive, we formulate the regression in log scale as log(Y where ε i,k ∼ N(0, σ 2 ).Note that we have one parameter β i for each location, and thus have a spatially varying calibration of the downscaled data.The parameter estimates ( α, β) are estimated by ordinary least squares, and we use That is, we use the median as a point estimate (and not the mean).Finally, to apply the method to other data sets as discussed in Sect.3.2, one simply replaces X q k with X H,q k .

Model 2: incorporating the spatial dependence
There is clearly spatial dependence in the data, which we want to incorporate in the model to improve the predictions.We can do this by assuming that the regression coefficients are spatially dependent, using a stochastic model as follows where again ε k ∼ N(0, σ 2 I ) and ij = C( s i − s j ), where C is a Matérn covariance function (Matérn, 1960): Here φ determines the variance of the process, ν is a shape parameter of the covariance function, and κ determines the correlation range.
A more computationally efficient alternative to the covariance-based model would be to use a Markov random field prior on β, similar to that by Bolin et al. (2009).However, this is not needed since the data are measured only at 777 spatial locations, and we therefore use the simpler covariance-based approach here.
The model parameters θ = {α, κ, σ, φ, ν} are estimated using maximum likelihood.The log-likelihood function is given by where j = diag(log(X q j )) (ν, κ, φ)diag(log(X q j )) + σ 2 I .We find the ML estimates θ = { α, κ, σ, φ, ν} of the parameters using numerical optimization of the log-likelihood function.Specifically, the function optim in R Core Team (2015) is used for the optimization; see the source code at Bolin et al. (2016) for further computational details.The predictor of Y q k is obtained as To apply the model to other data one simply replaces X q k with X H,q k .

Model 1s and Model 2s: pre-smoothing the covariates
A somewhat surprising feature of the data are that the quantiles of the observed data, Y , are spatially smoother than the downscaled climate model output (see Fig. 2).Because of this, it is natural to add a step in the analysis where covariate is smoothed spatially before it is used in the regression model.This is done using the following model log(X Here log( X q k ) are independent realizations of a Gaussian-Matérn field and ε k ∼ N(0, σ 2 x I ).We estimate the parameters using numerical maximization of the log-likelihood where ˆ = (ν x , κ x , φ x ) + σ 2 x I .Model 1s and Model 2s are then obtained by using instead of log(X q k ) as covariate in Model 1 and Model 2 respectively.

Results for individual quantiles
In this section, we evaluate the performance of the methods described in Sect. 4 for calibrating individual quantiles.As the tests in Sect. 3 were based on the 95 % quantile, we focus on predicting Q OBS 0.95 .The results in Sect. 3 rested on predicting the last 20 % of the data (8 years) based on the first 80 %.Here, we make a more detailed investigation using 8-fold cross-validation to evaluate the performance of the models.The data are divided into eight 5-year periods, and the quantile for each 5-year period is predicted using a model estimated on the rest of the data.The quantiles for the observations, dERA40, and dBCM data for the first 5-year period can be seen in Fig. 2.
The results using the various models can be seen in Table 1, and the results obtained when training the models on the dERA data but using dBCM for prediction are shown in Table 2.One can note that the extension of Model 1 to Model 2 by adding spatial dependency does not improve the results for the dERA40-based predictions, whereas it greatly improves the BCM-based predictions.Furthermore, pre-smoothing improves Model 1 for the dERA40 predictions, whereas it improves Model 2 for the BCM predictions.The reference model Model Ref performs very similarly to Model 1.
Overall, Model 1s performs best for the dERA40 predictions, whereas Model 2s performs best for the dBCM predictions.For the dBCM predictions, Model 2s has satisfactory performance compared with the target performance for that case which is given by the Model 0 results.An example prediction using this model can be seen in Fig. 3.
The results for the different seasons are summarized in Table 3.For all seasons, the conclusion is that Model 1s is preferable if we both train and test the model on dERA40 data, whereas Model 2s is favoured if we train the model on dERA40 data and use that transfer function to calibrate dBCM input.The reason for this is likely that the additional smoothing done in Model 2s compensates for the added uncertainty when using dBCM data in the model trained on dERA40 data.

Discussion
The low quality of Norwegian precipitation in the HIRHAM regional model forced by reanalysis (Orskaug et al., 2011) necessitates a full quantile recalibration/bias correction.Our assessment of such a recalibration on test data indicates that it does a credible job of correcting the dERA40 model, even under changing weather conditions.In order to apply the calibration to climate projections, which is the ultimate goal of this research, we first experiment with the same regional model using a global climate model (GCM), run using historical forcings and corrected using the same calibration as for dERA40.Downscaled global models are unable to describe the observations well.When correcting these downscaled global models, we would of course not expect to get a perfect calibration to data, but would hope that the downscaled GCM/earth system model would describe a similar distribution to that of the observations over a reasonably long period.Unfortunately, this is not the case.Instead of adjusting the entire distribution, we are able to achieve a better performance by focusing on adjusting an individual quantile.In that case we were able to achieve error rates that indicate that the corrected downscaled climate model performed almost as well as the reanalysis-forced downscaling, indicating that this approach can be a useful tool in downscaling climate projections of precipitation over Norway.
There is a case in between the full quantile adjustment and the individual quantile adjustment, namely simultaneous adjustment of several quantiles.This will be subject to further research.
The sensitivity of regional dynamic downscalings to the lateral boundary conditions is well known (e.g.Rummukainen, 2010, and references therein), and one possibility would be to downscale other reanalyses and compare the results.Since we did not have access to such RCM runs, we were not able to pursue this.On the other hand, we have been able to look at other RCMs (such as the Swedish RCA3 (Samuelsson et al., 2011), with the same reanalysis as boundary condition) and other GCMs (such as the Hadley Centre HadCM3Q0 model).The bias correction based on other regional and global models is very similar to that based on HIRHAM and BCM (results not shown).

Figure 1 .
Figure 1.Fisher tests of the 95 % quantile of the winter season uncalibrated dERA40 test data (left panel), the calibrated dERA40 test data (middle panel), and the calibrated dBCM test data (right panel).The plots show significance level α = 5 %.
3.1 and 3.2 by considering different seasons separately.The seasons used are winter (December to February), spring (March to May), summer (June to August) and autumn (September to November).In our current setup, the dERA40 and OBS data are further divided into a training set and a test set.The training set is used to fit the calibration model.The transfer function thus obtained is applied to dERA40 data for the test period, which then is compared to observations for the test period.Here the training data are chosen to be the first 80 % of the total data, i.e. the years from 1961 to 1992.The test data are chosen to be the last 20 % of the data, i.e. the years from 1993 to 2000.

Figure 3 .
Figure 3. Example calibration for the pointwise 95 % quantiles using Model 2s with the dBCM covariate (right).The result is for the final 5-year time period in the cross-validation study.The observed quantiles (left) and uncalibrated dBCM (middle) are shown as references.

Table 1 .
Cross-validation mean square error for the 95 % quantile.The best model for each fold is displayed in bold.

Table 2 .
Cross-validation mean square error for the 95 % quantile using dBCM predictions.The best model (excluding Model 0) for each fold is displayed in bold.For this comparison, Model 0 can be seen as the target value we want to reach with the dBCM-based predictions.

Table 3 .
Average mean square error for models trained on dERA40 data.The values in the table are averages across the eight folds for each season.