Evaluating NARCCAP model performance for frequencies of severe-storm environments

An important topic for climate change investigation is the behavior of severe weather under future scenarios. Given the fine-scale nature of the phenomena, such changes can only be analyzed indirectly, for example, through large-scale indicators of environments conducive to severe weather. Climate models can account for changing physics over time, but if they cannot capture the relevant distributional properties of the current climate, then their use for inferring future regimes is limited. In this study, high-resolution climate models from the North American Regional Climate Change Assessment Program (NARCCAP) are evaluated for the present climate state using cutting-edge spatial verification techniques recently popularized in the meteorology literature. While climate models are not intended to predict variables on a day-by-day basis, like weather models, they should be expected to mimic distributional properties of these processes, which is how they are increasingly used and therefore this study assesses the degree to which the models are actually suitable for this purpose. Of particular value for social applications would be to better simulate extremes, rather than inferring means of variables, which may only change by small increments thereby making it difficult to interpret in terms of the impact on society. In this study, it is found that the relatively high-resolution NARCCAP climate model runs capture areas, spatial patterns, and placement of the most common severe-storm environments reasonably well, but all of them underpredict the spatial extent of these high-frequency zones. Some of the models generally perform better than others, but some models capture spatial patterns of the highest frequency severe-storm environment areas better than they do more moderate frequency regions.


Introduction
Predicting extreme weather events is a difficult challenge even for relatively high-resolution weather models because of the scale difference between the models and the very fine scales attributed to some high-impact weather events, such as tornadoes and hail storms.Coarse-scale climate models, therefore, cannot directly provide information about the distribution, and other characteristics of interest (such as timing and location), of such events.However, it is possible to simulate large-scale environments that are more favorable for severe weather (cf.Brooks et al., 2003;Thompson et al., 2007;Trapp et al., 2007Trapp et al., , 2009;;Diffenbaugh et al., 2013;Gensini et al., 2014;Elsner et al., 2015;Tippett and Co-hen, 2016;Tippett et al., 2014Tippett et al., , 2015)).Brooks et al. (2003) found that concurrently high values of convective available potential energy (CAPE; J kg −1 ) and 0-6 km vertical wind shear (S; ms −1 ) are useful large-scale indicators for environments conducive to severe weather.The conversion of CAPE into the (theoretical) maximum updraft wind speed (W max = √ 2 × CAPE) has a clearer connection to severe weather, and it has the same units as S (ms −1 ; cf.Trapp et al., 2007Trapp et al., , 2009)).For example, using the same data analyzed in Brooks et al. (2003), Gilleland et al. (2013) found that the product of W max and S yields a clearer distinction in probability distributions stratified by increasingly severe storms, and that a product, W max • S, > 225 m 2 s −2 , in particular, was Published by Copernicus Publications.E. Gilleland et al.: Frequency of severe-storm environments found to be associated with a fairly high likelihood for severe storms, with higher values (i.e., products > 500 m 2 s −2 ) indicating higher likelihoods for severe or worse (e.g., significant tornadic) weather.
Several studies have analyzed climate model output for these variables.For example, Trapp et al. (2007) used version 3 of the high-resolution regional climate model produced by the Abdus Salam International Centre for Theoretical Physics (RegCM3) to investigate future changes in CAPE and S under the Special Report on Emissions Scenarios (SRES, Nakicenvoic and Swart, 2000) A2 emission scenario.They looked at various measures, such as the number of days that the product of CAPE and S exceeded a high threshold locally.Van Klooster and Roebber (2009) also investigated changes under the A2 emission scenario, but using the coarse-resolution global Parallel Climate Model, and Gensini et al. (2014) examined both future and present convective environments using a dynamically downscaled global climate model (GCM), namely the Weather Research and Forecasting regional climate model (WRF-G) from the North American Climate Change Assessment Program (NARCCAP; Mearns et al., 2007Mearns et al., ,2014Mearns et al., , 2009)).
A few studies analyzed current reanalysis data using statistical extreme value techniques to project future scenarios.For example, Mannshardt and Gilleland (2013) investigated changes in the very extreme values at each grid point of W max • S separately using the National Center for Atmospheric Research (NCAR)/National Centers for Environmental Prediction (NCEP) reanalysis.Heaton et al. (2011) (henceforth, NCEP reanalysis) applied a rigorous spatial extreme value model using hierarchical Bayesian techniques to the same reanalysis data over North America.Gilleland et al. (2013) took a very different approach whereby they studied patterns of W max •S conditional upon the existence of extreme W max • S activity in the spatial field.
The aim of this study is to evaluate how well regional climate models from NARCCAP are able to capture frequencies of high values of the product of W max and S (henceforth WmSh); following Trapp et al. (2009), conditioning WmSh to be zero unless CAPE ≥ 100 J kg −1 and 5 ≤ S ≤ 50 ms −1 in order to ensure that there are sufficient amounts of both CAPE and S, without having too much S (values of S larger than 50 ms −1 greatly reduce any potential storm activity).In particular, it is desired to investigate how well these relatively high-resolution models capture spatial patterns of common severe-storm environments defined herein to be when the upper quartile (over space) of WmSh exceeds 225 m 2 s −2 .Analogous to Gilleland et al. (2013), attention is restricted to spatial patterns of frequency when conditioning on high field energy, defined to be when the upper quartile over space is large.
2 Reanalysis data and model output 2.1 NARR reanalysis data CAPE and S have been calculated from the NCEP North American Regional Reanalysis (NARR) product (http: //www.emc.ncep.noaa.gov/mmb/rreanl/Mesinger et al., 2006).NARR provides the "best guess" of the state of the atmosphere in the past on a reasonably high-resolution grid (32 km) after assimilating various observational sources (station data, rawinsondes, drop sondes, aircraft, geo-stationary satellites, etc.) into a model.The data product provides values on a much higher resolution grid than the NCEP reanalysis.
Of course, use of an analysis to evaluate a model has certain advantages and disadvantages.The main advantage is the availability of gridded values to directly compare to the model grid over any domain of interest.However, comparing a model-based field to an analysis is not the same as comparing the model directly to observations at points, because of the smoothing associated with most analyses.When the analysis is derived using an initialization field from the model being evaluated, the comparison results also can be highly biased and give very different results then would be arrived at using an analysis derived from a different model (e.g., Park et al., 2008).However, of relevance for this study, the NARC-CAP model runs are not directly based on the NARR.Moreover, because the primary purpose of this study is comparison of the performance of the various models (rather than absolute evaluation of each model), comparison of the models to the NARR analysis is of less concern and allows derivation of the fields of interest.

NARCCAP regional climate output
The primary focus of the present study concerns analyzing high-resolution regional climate model (RCM) output produced as part of NARCCAP (Mearns et al., 2007(Mearns et al., ,2014)), which consists of a matrix of combinations of RCMs driven by a suite of GCMs (Mearns et al., 2009).The resolution of the regional models is about 50 km on a grid covering most of North America.The models were run under the SRES (Nakicenvoic and Swart, 2000) A2 scenario for the future period 2041-2070, but also include runs for the present period 1971-2000.For verification, the models are also driven by the NCEP/Department of Energy Reanalysis II product (Kanamitsu et al., 2002) for the period 1979-2004, hereafter dubbed the NCEP-driven simulations.In order to establish an understanding of model uncertainty, different combinations of RCMs and AOGCMs are used.Bukovsky (2012) analyzed the performance of the full set of six RCMs of the NARCCAP project driven by the NCEP reanalysis for 2 m temperature trends from 1980 to 2003.It was found that the RCMs have some capability to simulate such resolved-scale trends in the spring, and to some extent, the winter.However, results for other seasons may be more dependent on the type and strength of the underlying observed forcing.Precipitation over California was considered by Caldwell (2010) using all NARCCAP model outputs that provided this variable.They found that RCMs forced by the NCEP-reanalysis tend to overpredict precipitation over California, and they concluded that this result was caused by overprediction of extreme events where otherwise the frequency of precipitation events was underpredicted.Interannual variability in NARCCAP RCMs is analyzed by de Elía et al. (2013), where it was found that important departures between RCMs and observations exist, which is also the case for some of the driving models.Nevertheless, they found that the expected climate change signal remained consistent with previous studies.
Table 1 summarizes the models and model combinations used for CAPE, S, and WmSh analyzed in this study.As can be seen, various combinations of RCMs driven by GCMs are employed, and are designed to provide a subset of combinations that constitute a minimal set that can still produce statistically significant results in terms of model uncertainty.
To enable comparison between outputs from different models, the final conditional frequency fields are re-gridded from the native grids of the RCMs to a common half-  Current (1971Current ( -2000) ) and future (2041-2070) model combinations from NARCCAP (Mearns et al., 2007(Mearns et al., ,2014) )  degree grid.Because these fields are relatively smooth, the results will be relatively insensitive to the exact interpolation method used.For this work, the polynomial patch interpolation algorithm implemented by the Earth System Modeling Framework, which takes the local derivatives of the field in a neighborhood around the interpolation point into account, is implemented (ESMF Joint Specification Team, et al., 2014).

Methods
For this paper, the focus is on evaluating patterns of the largescale indicators for severe weather conditional upon having high field energy.Following Gilleland et al. (2013), high field energy is taken to mean that the upper quartile (over space) of the variable of interest is larger than its 90th percentile (over time).For example, from the space-time process for WmSh, a new univariate time series is calculated that represents the upper quartile of WmSh over space; this univariate time series is called q75.Then, for time points when q75 is large (defined to be when it is greater than its 90th percentile over the entire time series) the frequencies of WmSh exceeding 225 m 2 s −2 are found for each grid point, resulting in a single spatial field that summarizes where the most intense severestorm environments are found.This resulting summary field is denoted ω, and represents the frequency at each grid point when severe-storm environments occur most often.Figure 1 shows ω for the NARR and each model configuration from Table 1.Similarly, CAPE alone is also analyzed, and its conditional frequency field (for CAPE ≥ 1000 J kg −1 ) is denoted, κ (Fig. 2).For convenience, Table 2 displays some of the notations used throughout the text.
A CAPE value of 1000 J kg −1 is an arbitrary choice, but represents a value associated with severe weather environments as found in previous studies (e.g., Brooks et al., 2003;Trapp et al., 2007;Gilleland et al., 2013;Heaton et al., 2011).The value of 225 m 2 s −2 is also an arbitrary choice, but is around the value obtained when converting CAPE from 1000 J kg −1 to W max and then multiplying by S = 5 ms −1 , which again results in a strong severe-storm environment.Of course, WmSh could have a value of 225 m 2 s −2 with far lower CAPE (i.e., with higher S), but because CAPE is also conditioned to be at least 100 J kg −1 and S at least 5 m 2 s −2 , the environment is guaranteed to be conducive to severe weather.

Spatial pattern and location displacement measures
Several methods are available to summarize the performance of a model in terms of how well it captures the spatial patterns, locations, and general shape of observed variables for various thresholds of intensity, and we summarize the three used in the present study, namely: (a) Baddeley's , (b) mean error distance, and (c) the forecast quality index (FQI).Most of these summary measures can be calculated from distance maps (cf.Fig. 3, which shows an example).A distance map is a graphic that shows, for each grid point, x, the shortest distance from x to the nearest "event", where an event is defined by a grid cell that exceeds the given threshold.Of particular interest is the image in the third row of the figure, which shows the absolute difference in distance maps for binary fields A and B; several popular measures are derived directly from this difference field.Also of interest is to mask out the distance map A using the binary field B, and vice versa (bottom row of the figure).
The Hausdorff distance is a widely known location measure that simply takes the maximum value of the absolute differences between distance maps for two fields (e.g., the maximum value from the image in the third row of Fig. 3).The metric has often been modified in order to have measures that are not as sensitive to small changes in the two fields resulting from taking the maximum value.The Baddeley image metric (Baddeley, 1992a, b), for example, is a modification of the Hausdorff metric that replaces the maximum with an L p norm, and, in general, further modifies the shortest distances between each grid point and the event usually by setting any distances greater than a certain amount to a constant.That is, where N is the total number of grid points, the summation is over all points, s, in the grid, and f is any continuous func- tion on [0, ∞) such that f (x+y) ≤ f (x)+f (y) and is strictly increasing at zero with f (t) = 0 if and only if t = 0.For example, a common choice is to take f (x) = min{x, constant}.The function's purpose is to eliminate edge effects, but the metric is highly sensitive to the choice of constant.The parameter p is chosen by the user.If p = 1 a straight average of values is achieved.The Hausdorff metric can be regained by letting p tend to infinity.In the limit as p approaches zero, one obtains the minimum of the portion within the absolute values of Eq. ( 1).
In terms of Fig. 3, Baddeley's metric applies a function f to each of the two fields in the second row, then takes the absolute difference between these fields, and finally takes the L p norm over the resulting image.Following Gilleland (2011), here, f (x) = min{x, constant}, but the constant is set to infinity and p to two, so is simply the L 2 norm of the image in the third row of the figure.
The mean error distance (MED) is the mean of the distance map for one field taken over only the events in the other field.In other words, again using Fig. 3 as an example, MED(A, B) averages the image in the second row and first column over the event space defined by the binary field for B (top row second column).Note that the MED is not symmetric because, in general, MED(A, B) = MED(B, A).In fact, the lack of symmetry provides a useful measure for diagnosing how close (or far away), in terms of average distance, forecast events are from those observed when calculating MED(Observation, Forecast), and vice versa for MED(Forecast, Observation).See Gilleland (2016b) for more information about this approach.If the two values are close together, then it suggests better agreement between the two fields in terms of both placement and numbers of events.For example, in the figure , MED(A, B) would be rather small (average of the non-white areas in the bottom left panel) because B does not have any events far away from A, which is generally indicative of good agreement between the two fields.However, MED(B, A) would be comparatively large Product of W max and S conditional upon WmSh (m 2 s −2 ) * CAPE ≥ 100 J kg −1 and 5 ≤ S ≤ 50 ms −1 Time series of the upper quartile of a random q75 variable, x, taken over space High field energy q75 larger than its 90th percentile (taken over time) Frequency of CAPE ≥ 1000 J kg −1 κ conditional upon high field energy Frequency of WmSh ≥ 225 m 2 s −2 ω conditional upon high field energy * Previous studies (namely, Gilleland et al., 2013;Mannshardt and Gilleland, 2013) use the abbreviation WmSh to refer to the straight product of W max • S, whereas here it has the further constraint that CAPE must be at least 100 J kg −1 and 5 ms −1 ≤ S ≤ 50 ms −1 .
(average of the non-white areas in the bottom right panel) because field A has a large event that B lacks.A further modification of the Hausdorff distance, proposed by Venugopal et al. (2005) specifically for verifying highresolution forecasts, normalizes the measure through an average of partial Hausdorff distances (PHD) obtained for surrogate fields, which results in a measure that, like , has desirable mathematical properties.It is one of the rare location metrics that also incorporates intensity information in addition to just spatial pattern information.First, stochastic realizations of the observed process that are forced to have the same Fourier spectra, probability density function, and spatial correlation structure as the observed field, called surrogate fields, are drawn to be used as a normalizing factor.Then, the FQI between two fields A and B, where A is the observed field, and C i is the ith of n surrogate realizations of A, is given by Here, PHD k is the partial Hausdorff distance using the kth largest shortest distance value, µ and σ denote the mean and standard deviation, respectively, over the field.The denominator on the right is derived from another image summary index called the universal image quality index (UIQI; Wang and Bovik, 2002), which is the denominator on the right multiplied by the correlation between the two fields.Venugopal et al. (2005) refer to the denominator as the modified UIQI; the first component of which is a measure of the model field's bias, and the second of the variability.The UIQI ranges between −1 and 1, with a value equal to one indicative of a perfect match between the two fields.A smaller value of UIQI indicates a lot of variability.

Feature-based analysis
Numerous methods are used in meteorology that fall under the category of feature based.The idea is to identify individual features within a field, and analyze those features for various characteristics, for example, those discussed in Sect.3.1 above.In this study, the method proposed by Davis et al. (2006a) is loosely followed.In meteorological applications, fields often have multiple features of interest, which would also be the case in the present context if a much larger domain were employed.However, even with a larger domain, the relative smoothness of these climate fields results in few distinct objects.Subsequently, it is very straightforward to match features between fields (e.g., as was proposed by Davis et al., 2006aDavis et al., , b, 2009, with the Method for Objectbased Diagnostic Evaluation, or MODE), as well as to merge features within a field.The fields of κ and ω demonstrate only one or two features in each model field, all of which are clearly matched with the one NARR feature that shows up.Therefore, the task is fairly simple.
Here, features are defined by simply identifying contiguous grid points that exceed a threshold of 75 % frequency; indicating areas where storm favoring environments are found to occur very often.Various summary properties are evaluated and compared in the present study.Focus is centered on the distances between the centers of mass of the features be- tween each field (centroid distance), the area ratio (defined to be the area of the smaller feature divided by the area of the larger one; though here the model feature areas are always smaller), the common intersection area (given as a percent), and Baddeley's described in section 3.1.Other properties are shown for information, but are not highly useful as comparative measures.

Field deformation
Field deformation methods deform the spatial locations of the forecast field so that the values of the forecast variable better align with those of the observations.Numerous different methods for deforming the field are available, and many have been proposed in the atmospheric science liter-ature for forecast verification (e.g., Keil andCraig, 2007, 2009;Gilleland et al., 2010b, c;Gilleland, 2013;Marzban and Sandgathe, 2010), forecast calibration (e.g., Hoffman et al., 1995;Alexander et al., 1998;Nehrkorn et al., 2003;Levy et al., 2013;Kleiber et al., 2014), and data assimilation (e.g., Hoffman and Grassotti, 1996;Nehrkorn et al., 2015) as well as short-term forecasting (e.g., Aberg et al., 2005).In this study, we follow the image warping approach of Aberg et al. (2005), which utilizes a pair of thin-plate spline transformations (cf.Dryden and Mardia, 1998, chapter 10) to make the deformation mapping, which maps a subset of k control locations from the observed field, or 0-energy field, to k locations in the forecast field, dubbed the 1-energy field.
www.In many image warping applications, the set of 0-and 1energy control locations are easily found by hand.For example, if comparing the images of one person's face to another, it is easy to choose corresponding features, such as the point of the nose or the top of the head.While it is also possible to choose the control locations by hand, an alternative approach is preferred here, whereby the 1-energy control locations are found through a numerical optimization procedure.The objective function to be optimized follows the approach of Aberg et al. (2005), Gilleland et al. (2010b, c), Gilleland (2013), and originally introduced by Glasbey and Mardia (2001), which is effectively the root mean square error (RMSE) between the observed and deformed forecast field plus an additional penalty term for mappings that cover too much distance or are too nonlinear.The latter term helps to prevent obtaining deformations that yield non-physical deformations, such as folding.The penalty term includes a precision matrix, or the inverse of a covariance matrix for the mappings.Aberg et al. (2005) employ a precision matrix that is zero for locations separated by a certain distance, and positive otherwise, which helps to obtain deformations where nearby locations move in similar directions.In this study, the bending energy matrix, described below, is used for the deformation, which penalizes nonlinear deformations.
The pair of thin-plate spline transformations is the bivariate function, (s) = ( 1 (s), 2 (s)) T = a + Gs + W T ψ(s), where the set of all locations, s, in the domain is d × 1 (here, the dimension d = 2), and ψ(s) = (ψ(s − p 0,1 ), . .., ψ(s − p 0,k )) T , where p 0,i , i = 1, 2, . .., k are the 0-energy control locations, and The mapping has dk+d 2 +d parameters: (a) the d ×1 vector, a, (b) d × d matrix G, and (c) the k × d matrix W, which for d = 2 results in 2k + 6 parameters.The natural thin-plate splines used herein are subject to the further constraints that the columns of coefficients in W sum to zero (i.e., 1 T W = 0) and that the sum of the products of these coefficients times the 0-energy control locations is also zero (i.e., p T 0 W = 0).The set of equations can be written succinctly in matrix form as The inverse matrix, L −1 , is of particular importance because when performing the numerical optimization, this matrix needs only to be calculated one time at the beginning, and it defines the resulting warp function, which is a linear function of the 1-energy control locations and the upper left k × k partition of L −1 , denoted by L 11 .That is, W = L 11 p 1 .The matrix L 11 is also known as the bending energy matrix because it determines the amount of the nonlinear deformation; the matrices a and G give the linear, or affine, part of the deformation.
Note that because of the constraints on W, there are also three constraints on L 11 .Namely, 1 T k L 11 = 0 and p T 0 L 11 = 0 (recalling that p 0 is k × 2).The transformations imposed by Eqs. ( 3) and (4) minimize the total bending energy of all other possible interpolating functions from the 0-energy control locations p 0 to the 1-energy control locations p 1 , and the total minimized bending energy (referred to henceforth as simply the bending energy) is easily found from trace(p T 1 L 11 p 1 ).In order to find the optimal mapping of p 0 to p 1 , the 0energy control locations are chosen and fixed, and then the p 1 locations are moved until an objective function is minimized.Denoting the 1-energy field by Ẑ and the 0-energy field by Z, the objective function used here is the same as that of Gilleland et al. (2010c), and is given by (5) where the x and y subscripts denote the two component coordinates of the control locations, β is a penalty term chosen a priori to determine how much or little nonlinear warps should be penalized, and σ ε is a nuisance parameter giving the error variance between the 0-energy and deformed 1-energy fields.
The objective function (Eq.5) results from the penalized likelihood under an assumption of Gaussian errors between the 0-energy and deformed 1-energy fields, and potentially provides a means for obtaining confidence intervals on the deformations, but this potential will be left for future work.Gilleland et al. (2010c) began with two identical and regular sets of control locations, and used a multi-step procedure that begins with four control locations and a highly smoothed set of fields, ratchets the number of control points up iteratively with decreasingly smoothed sets of fields to minimize the objective function Q from Eq. ( 5), which enables a completely automated method for finding the optimal deformation.Because there are only 9 × 2 = 18 warped fields to find here, the domain is relatively small, and only a very small number of control locations are required (four to eight, whereas 200 were used in Gilleland et al., 2010c), a less automated procedure is employed in this work.First, about four control locations are selected by hand in the 0-energy field, and an attempt is made, again by hand, to identify where those locations map to those in the 1-energy field.These 1energy control locations are then used as initial values in the numerical optimization routine used to minimize Q.

Spatial prediction comparison test
The spatial prediction comparison test (SPCT) is a test introduced by Hering and Genton (2011) that is a spatial modification of a similar time series test introduced by Diebold and Mariano (1995), and it provides a statistical hypothesis test for two competing forecast models, m 1 and m 2 , compared against the same observation, a that accounts for spatial correlation.Hering and Genton (2011) found the test to be both powerful and of the right size provided the range of dependence is not too long, even in the face of contemporaneous correlation (i.e., when m 1 is correlated with m 2 ).First, a loss function, g, must be chosen and applied to each model against a, giving g 1 = g(m 1 , a) and g 2 = g(m 2 , a).For example, absolute error (AE) loss yields g(x, y) = |x−y|.Then the loss differential field, d, is calculated by taking the difference at each spatial location between g 1 and g 2 (i.e., g 1 −g 2 ).
After checking for the existence of spatial drift in the spatial loss differential field, and removing any spatial trend before proceeding, the empirical variogram for d is found, say γ , using all lags up to half of the maximum possible lag for the study region.Next, a parametric variogram model is fit to γ ; following Hering and Genton (2011), the exponential variogram is used here.The test statistic is the usual Student's t or normal approximation for the paired sample test of the mean of the loss differential field, but where the standard error is estimated by averaging the values from the spatial correlation function, by way of a linear combination of the parametric variogram fit to γ over all lags of the domain.In other words, the test statistic, S v , to test whether or not the mean loss differential, d, is significantly different from zero is where with θ estimated parameters of the parametric variogram model evaluated at each lag h ij .Because the spatial fields presently studied all have a reasonably large number of grid points, the normal approximation of the test is used throughout.
Here, the test is conducted under AE loss first, and then with AE + deformation loss for comparison.The latter was proposed by Gilleland (2013), and allows for both spatial displacement/pattern error and intensity errors to be simultaneously incorporated into the test, while also accounting for spatial correlation.The loss is achieved by finding the AE between the observed field and the deformed forecast field (cf.Sect.3.3) and adding these errors to the (Euclidean) distance each point "traveled" in order to achieve the re-aligned field.Threshold (percentile) Baddeley's delta metric q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 20 40 60 80 0 5 10 15 Threshold (percentile) Mean error distance (F, O) q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 20 40 60 80 0 5 10 15 Threshold (percentile) Mean error distance (O, F) q q q q q q q q q q q q q q q q q q q q q q q q q q q Figure 4. Baddeley's (p = 2, c = ∞; top left), mean error distance conditioning on observed events (top right) and mean error distance conditioning on "forecast" events (bottom left) for κ.

Results
Figure 4 shows the results for the location measures for ω, the frequency of WmSh greater than 225 m 2 s −2 conditional on high field energy (see Table 2 for notation).It can be argued from visual inspection of the graphic that the models driven by the HadCM3 global model are closest to reproducing the patterns of ω associated with the NARR reanalysis.This result is consistent across the thresholds for mean error distance, but the HRM3-HadCM3 has higher (worse) Baddeley metric values for the highest thresholds.In terms of capturing the spatial structure of the most frequent events for WmSh, the CCSM3-driven runs are the least similar to those found in the NARR, and the WRFG-CGCM3 performs the worst in terms of capturing the spatial patterns of ω according to the Baddeley metric; the results for κ (not shown) are similar.Of course, these results do not account for sampling uncertainty, so no conclusions can be made with statistical significance based on these measures.A feature-based analysis is also conducted (Tables 3  and 4), which provides similar, but more detailed information about how the fields compare to one another.Tables 3 and 4 show summary statistics for identified ω features (Table 3) and feature comparisons for matched (possibly first merged) ω features (Table 4) after having set a threshold of having at least 75 % frequency of occurrence. 1 In each case, it is clear that the HRM3-HadCM3 does the best job of all of the models at achieving a roughly correct spatial pattern for the most frequent ω areas.It has a relatively low centroid distance, angle difference, and Baddeley value, as well as one of the highest area ratios (0.90) and intersection areas (also 0.90; tied for highest with WRFG-NCEP).Moreover, it has the same number of identified features above the 75 % threshold as the NARR.For all of the fields, the largest feature is in the southeast corner of the domain over the ocean, and in most cases hugs the border, suggesting that high CAPE would be modeled beyond the edge of the domain.Results for κ (not shown) are similar.The bearing is calculated from the model feature centroid to the NARR feature centroid with north as the reference, which simply gives a sense of the direction in which the features of one field are situated with respect to the other.For a model whose output variable has small separation distance and good area overlap with the observed feathe package spatstat (Baddeley and Turner, 2005).Orientation angle and aspect ratio are found with help from package smatr (Warton et al., 2012).
ture (e.g., CRCM-CGCM3), the bearing is perhaps not very meaningful.However, for those with larger separation distances and less area overlap (e.g., the models here have fairly good spatial pattern matches, but MM5I-HadCM3 is a candidate for checking the bearing to see if the problem exists for other variables), then the bearing could prove useful to a modeler hoping to diagnose how the model failed.
At lower thresholds than 75 % frequency (not shown), an additional area of high frequency is generally observed in the southwest near or over Baja California.Careful inspection of models using the CCSM3 as the driving model reveals that there is a tendency for more numerous, but smaller, features than produced by the NARR or other driving models (cf.Table 3).In each case, these disjoint features are merged (using centroid distance as the primary criterion) before comparing with the NARR as they are primarily located in the southeast region.
The values in Table 4 can be combined into a single summary very effectively using the fuzzy logic algorithm described in (Davis et al., 2006a, b), which yields a measure called total interest that incorporates user-specified weights in order to obtain a measure based on the attributes of a feature that are most important.It ranges between zero and 1 where a value of 1 indicates a perfect match and the worst value is zero.The technique is performed for these features using the same interest maps and weights as proposed in Davis et al. (2006a).All of the total interest values are very high, ranging from 0.91 to 0.94, indicating good agreement between the models and the NARR.
It is also of interest to determine if one model stands out above others.To do so, we use the SPCT with AE loss, which is a very conservative test because small-scale errors and spatial displacements are not taken into account, and none of the results is statistically significant at any reasonable level suggesting that the null hypothesis of equal performance (as measured by the mean AE loss differential) cannot be rejected.In order to factor in spatial alignment and small-scale errors to the test, the SPCT is also applied with AE + deformation loss following Gilleland (2013).
Indeed, inspection of the graphs of κ (Fig. 2) clearly reveals that some models capture the spatial patterns of the high-event frequency CAPE areas better than others.Field deformation techniques are well-established methods for verifying forecasts spatially where small mis-alignments in space obfuscate model performance.Table 5 displays the results of having found the optimal deformation for each model deformed to better align spatially with the NARR reanalysis.Shown are the original RMSE, denoted RMSE 0 , the RMSE after having applied the optimal deformation, RMSE 1 , the percent reduction in RMSE, and the minimum bending energy required to arrive at the optimal re-alignment.The minimum bending energy is not a summary of the entire deformation, only the non-affine ones.Thus, a small bending energy does not imply that the deformation is necessarily small, but rather that nonlinear distortions are not abundant.However, the bending energy is useful as a comparison because a field, A, with higher bending energy than a field, B, implies that A matches less well than B with the 0-energy field in terms of overall shapes of patterns.A perfect model would have zero RMSE 0 and thus no reduction in error or bending energy.A good model will have a low RMSE 0 paired with low bending energy and often a relatively high reduction in error.A bad model will have relatively high RMSE 0 and either high reduction in error paired with high bending energy, or low reduction in error paired with low bending energy.
Figures 5-7 display examples of the resulting field deformations for κ, for typical deformations for these cases (Fig. 5), the HRM3-HadCM3 (Fig. 6), which requires very little deformation because the original field is already closely aligned with the NARR, and a case where the spatial alignment (and intensities) are fairly poor; resulting in a more tortured deformation (Fig. 7).In most cases, a small amount of affine and nonlinear deformation results in considerable error reduction.The cases that require more nonlinear deformations (MM5I-CCSM3, WRFG-CCSM3, and WRFG-CGCM3; latter two not shown) stand out in both the "distance traveled" and "deformed 1-energy" panels for requiring a relatively large amount of deformation in order to match well with the NARR data product.
Severe thunderstorms require high CAPE, which is basically a measure of the amount of energy available to create very strong updrafts in thunderstorms.High CAPE environments have a warm, moist boundary layer, with colder air aloft, the latter of which increases conditional instability.Proximity to warm, large bodies of water in the domain (i.e., the Gulf of Mexico, Caribbean, and Gulf of California) plays a large role in dictating the spatial distribution of high CAPE in the domain as they are the primary sources of moisture.Moisture transport mechanisms also play a role.High CAPE does not often occur at high elevation or near the west coast because near-surface moisture is too low and/or near-surface temperature is too cold.In the CCSM3-driven simulations, the RCMs inherit an atmosphere that is too dry from the CCSM3 in the warm season (Bukovsky and Karoly, 2010;Bukovsky et al., 2013).This dryness would strongly effect the frequency of high CAPE values in the central part of the country during the dominant season for severe weather in the region.In the MM5I vs. the CRCM-CCSM3-driven simulations, it is likely that moisture transport mechanisms simulated by the regional models are playing a strong role in dictating the distribution of moisture, thus resulting in the spatial distribution of high CAPE frequencies east of the Mississippi River.In the HRM3-HadCM3, it is likely that warmseason low level winds are a bit too southeasterly through the Plains, carrying more moisture into the High Plains and Rocky mountain region than is observed, leading to the high CAPE frequencies seen from central Mexico north through Wyoming and eastern Montana.
The reductions in error range from only about 19 % to almost 62 % (WRFG-CCSM3 and WRFG-CGCM3); the WRFG-NCEP case had the third highest reduction in er-   ror.Indeed, the WRFG model combinations had some of the worst spatial alignment with the NARR, so the improvement induced by re-alignment is the most drastic.It should be noted, however, that the deformations for the two WRFG cases, WRFG-CCSM3 and WRFG-CGCM3, also have the largest amount of nonlinear deformation with minimum bending energies much greater than any other model combinations.Inspection of the graphs of the deformations (not shown) suggests that the linear deformations are also large for these cases.HRM3-HadCM3 has the least amount of bending energy, and only a small amount of affine displacements from the NARR.Nevertheless, with only a small amount of deformation, this model still achieves a reduction in RMSE, which is small to begin with, by almost 30 %. Field deformation results for ω (Table 6) are, not surprisingly, similar to those for κ, with percent reduction in RMSE ranging from about 16 % to about 50 %.Bending energies are similar, where the MM5I-CCSM3 again requires the most bending energy, but this time at a much higher value of almost six.Results for the CRCM-CGCM3 configuration are arguably the worst with a relatively large RMSE 0 of about 0.11, and a very small reduction in error of only about 11 % that is achieved only after requiring a relatively high amount of bending energy (≈ 1.15).Following Gilleland (2013), the SPCT is applied with AE plus deformation loss induced by the above deformations.Some relatively significant results are now found; including one case with better than 1 % significance, one with better than 5 % significance, two cases with better than 10 % significance, and three with about 15 % significance.Table 7 displays the test results for the cases where the p value is less than 0.50.As mentioned above, the HRM3-HadCM3 model appears to be the closest to the NARR in terms of spatial pattern and location, as well as having about the right frequencies in these areas; only a relatively small amount of deformation is needed to optimize the alignment.Subsequently, it is no surprise that this model is shown to be better than all the other models; three of which are significantly better at the 10 % level or better according to the SPCT with AE + deformation loss.Models with the HadCM3 component generally fared very well under this test, and the MM5I combinations also fared well.In general, the worse models failed to capture the spatial extent of areas with frequently Values shown are the mean loss differential statistic and associated p value in parentheses.Bold face emphasizes the "better" model according to the test; where negative (positive) values mean model 1 (model 2) is better.( * * * ) indicates significance at the ≈ 0 % level, ( * * ) at the 5 % level, ( * ) at the 10 % level, ( † ) at the 20 % level.Note, the CRCM-NCEP case is not included because a good-fitting variogram could not be found for any of the loss differential fields associated with this model.They also tend to project considerably less frequency in the southwestern part of the domain.Despite the fact that the ω deformation results are similar to those for κ, the SPCT with AE + deformation loss results are less similar.However, the HRM3 configurations do still tend to outperform other models, in one case with statistical significance at almost the 10 % level (Table 8).

Conclusions
In this study, several advanced weather forecast verification techniques for high-resolution gridded verification sets are applied in a novel way to severe-storm indicators from several of the North American Climate Change Assessment Program (NARCCAP) climate models.In particular, focus is placed on the distributional property of how well the models capture the frequencies of severe-storm environments when the field energy is high, where field energy is defined by the upper quartile over space and this field energy is considered to be high when it is in the upper 90th percentile over time.For ease of discussion, we denote κ to be the frequency of CAPEs exceeding 1000 J kg −1 conditional upon high field energy for CAPE, and similarly, ω to be the frequency of WmSh's exceeding 225 m 2 s −2 conditional upon high field energy for WmSh, where WmSh is equal to √ 2 • CAPE • S, and S denotes 0-6 km vertical wind shear (ms −1 ), provided that CAPE ≥ 100 J kg −1 and 5 ≤ S ≤ 50 (zero otherwise).Previous studies found concurrently high values of CAPE and S to be important indicators of severe-storm activity, and the derived WmSh indicator from these coarse-scale variables discriminates severe-storm activity well as a univariate variable.
In general, the NARCCAP runs under estimate the spatial extent of high frequency κ and ω where the HRM3-HadCM3 model run performs the best; having an area ratio near unity at ≈ 90 % and an intersection area of about 89 % for κ and about 90 % in both categories for ω.The CRCM-NCEP run is the next best in this regard with only slightly lower ratios.For ω the numbers are similar for these models, although the CRCM-NCEP has slightly better overlap (intersection area about 0.91 vs. 0.84) and the CRCM-CGCM3 also has a high area ratio (0.82 vs. 0.60) and intersection area (0.85 vs. 0.66).Otherwise, the area ratios for high frequencies for most models range between about 20 and 60 % (0.91 for HRM3-HadCM3) for κ and between about 35 and 95 % for ω; results are similar for intersection area for both frequencies.
The application of binary image metrics suggests that overall the models do reasonably well at capturing highfrequencies of κ and ω, but for the very high-frequency areas, some models perform less well.In particular, mean error distance and Baddeley's are applied, and for thresholds above 80 %, it is found that the best models at capturing κ are those that drive the regional models HRM3 and WRFG.The worst at capturing the spatial patterns for κ are those with CRCM and MM5I regional models, as well as those driven by CCSM3 and CGCM3.For ω, the runs with NCEP as the driving model perform worse at capturing frequency area patterns for frequencies above about 80%, as well as those utilizing CCSM3 and CGCM3.
The above results consider only spatial areas of highfrequency severe-storm environments.Two methods are utilized in this study to address both the spatial alignment and intensity (i.e., frequencies) simultaneously: the forecast quality index (FQI; not shown) and the spatial prediction comparison test (SPCT) with absolute error (AE) + field deformation loss.For both κ and ω, FQI results suggest that all of the models perform best at capturing severe-storm environment frequencies and spatial patterns of those frequencies for thresholds between about 30 and 75 %.At the lowest thresholds, CRCM-CCSM3 and CRCM-NCEP stand out as being exceptionally good for both κ and ω meaning that they may underpredict the frequency of severe-storm environments, but they otherwise capture these areas but with too few occurrences.The SPCT with AE + field deformation loss is an overall estimation of how well the models perform directly (without relying on setting thresholds).For κ, the HRM3-HadCM3 model is clearly the best model, but models driven by CCSM3 fare well generally, as do those with the CRCM regional model.For ω, the CRCM-CCSM3 is the clear winner over all other models, but the HRM3-HadCM3 also performs well.The MM5I regional model is generally outperformed by other models.
The utility of applying spatial forecast verification techniques for climate model evaluation studies is presented, and the results of this study for severe-storm environments provide important insight into how to interpret future model runs for these NARCCAP models.In particular, caution is required when considering very high frequencies for ω, and focus should be restricted to more moderate thresholds.Moreover, the spatial extent of future storm environments may be an underestimation from nearly all of the model runs, and more weight should be put on the HRM3-HadCM3 run than other models, with considerably less weighting on model combinations involving CGCM3.
Some methods provide analogous information, which provides consistency in ascertaining model performance, but each can provide its own unique perspective depending on the fields in question.For example, image warping is a highly complicated approach, which could be considered unnecessary for simply inferring about how far off each model is from the NARR.On the other hand, it provides the only method known to the current authors that provides a statistical hypothesis test (or confidence intervals) that accounts for both spatial correlation and displacement errors.The binary image metrics such as the Hausdorff, partial Hausdorff, and Baddeley all provide distributional summaries of the absolute difference in distance maps between two binary "event" fields, with providing arguably the most useful information.A summary of these measures can be found in Baddeley (1992a, b) and Schwedler and Baldwin (2011).The FQI incorporates such displacement information, but also intensity information so that it may provide redundant information as these other distance map-based measures, but depending on the intensities, it could also yield different results.The feature-based approaches utilize many of these same types of information, but inform about specific features within a field, which in the present case is less important, but does describe how some models have two smaller features instead of one large feature (i.e., area of higher frequency κ/ω).

Figure 1 .
Figure 1.Images of the ω frequencies (shown as percentages per a reviewer request) for NARR and each NARCCAP model configuration.

Figure 2 .
Figure 2. Images of the κ frequencies for NARR and each NARCCAP model configuration.

Figure 3 .
Figure3.Top row: two binary images with dark blue showing event areas in A and B, respectively.Second row: distance maps for the images in the top row (shortest distances, in numbers of grid squares, from each pixel/grid point to an event in A or B).Third row: absolute values of the differences between the distance maps in the middle row.Bottom left: image from the second row left masked by the image from the top row right, and bottom right is the image from the second row (right) masked by the image in top row (left).

Figure 5 .
Figure 5. Deformation results for κ.Top left is NARR reanalysis (0-energy field), top middle is CRCM-CCSM3 (1-energy field), top right is the error between NARR and CRCM-CCSM3.Bottom left shows the distance that the intensity "traveled" to arrive at each grid point, bottom middle is the deformed CRCM-CCSM3 field, and bottom right is the error field between NARR and the deformed CRCM-CCSM3.

Table 2 .
Some notations and abbreviations used in this manuscript.

Table 3 .
Feature identification and properties for frequency of ω.Features identified using a threshold of 75 % frequency.

Table 4 .
Merged and matched feature comparisons for ω.Features identified using a threshold of 75 % frequency.Minimum boundary separation is zero for all comparisons.Total interest is given in parentheses below model name.

Table 5 .
Results from deforming climate models to better spatially align with NARR reanalysis for κ.RMSE 0 is the original RMSE, RMSE 1 the resulting RMSE between the deformed model and NARR, and the bending energy is a summary measure of the amount of nonlinear deformations applied to deform the field.

Table 7 .
SPCT results when AE + deformation loss is applied to κ. Results shown are only for those cases with p values ≤ 50%.
high values of CAPE and WmSh.They tend to miss, or underpredict, the high-frequencies in the northwest extending to eastern Colorado and Wyoming compared with NARR.