Introduction
Predicting extreme weather events is a difficult challenge even
for relatively high-resolution weather models because of the scale difference
between the models and the very fine scales attributed to some high-impact
weather events, such as tornadoes and hail storms. Coarse-scale climate
models, therefore, cannot directly provide information about the distribution, and other
characteristics of interest (such as timing and location), of such events.
However, it is possible to simulate large-scale environments that are more
favorable for severe
weather cf..
found that concurrently high values of convective
available potential energy (CAPE; J kg-1) and 0–6 km vertical wind
shear (S; ms-1) are useful large-scale indicators for environments
conducive to severe weather. The conversion of CAPE into the (theoretical)
maximum updraft wind speed (Wmax=2×CAPE) has a
clearer connection to severe weather, and it has the same units as
S ms-1; cf.. For example, using
the same data analyzed in ,
found that the product of Wmax and S yields a clearer distinction in
probability distributions stratified by increasingly severe storms, and that
a product, Wmax⋅S, >225 m2 s-2, in particular, was found
to be associated with a fairly high likelihood for severe storms, with higher
values (i.e., products >500 m2 s-2) indicating higher likelihoods
for severe or worse (e.g., significant tornadic) weather.
Several studies have analyzed climate model output for these variables. For
example, used version 3 of the high-resolution regional
climate model produced by the Abdus Salam International Centre for
Theoretical Physics (RegCM3) to investigate future changes in CAPE and S
under the Special Report on Emissions
Scenarios SRES, A2 emission scenario. They
looked at various measures, such as the number of days that the product of
CAPE and S exceeded a high threshold locally. Van Klooster and
Roebber (2009) also investigated changes under the A2 emission scenario, but
using the coarse-resolution global Parallel Climate Model,
and examined both future and present convective
environments using a dynamically downscaled global climate model (GCM),
namely the Weather Research and Forecasting regional climate model (WRF-G)
from the North American Climate Change Assessment
Program NARCCAP;.
A few studies analyzed current reanalysis data using statistical extreme
value techniques to project future scenarios. For
example, investigated changes in the very
extreme values at each grid point of Wmax⋅S separately using the
National Center for Atmospheric Research (NCAR)/National Centers for
Environmental Prediction (NCEP) reanalysis.
(henceforth, NCEP reanalysis) applied a rigorous spatial extreme value model
using hierarchical Bayesian techniques to the same reanalysis data over North
America. took a very different approach whereby
they studied patterns of Wmax⋅S conditional upon the existence of
extreme Wmax⋅S activity in the spatial field.
The aim of this study is to evaluate how well regional climate models from
NARCCAP are able to capture frequencies of high values of the product of
Wmax and S (henceforth WmSh); following ,
conditioning WmSh to be zero unless CAPE ≥ 100 J kg-1 and
5 ≤ S ≤ 50 ms-1 in order to ensure that there are sufficient amounts of
both CAPE and S, without having too much S (values of S larger than 50 ms-1 greatly reduce any potential storm activity). In particular, it is
desired to investigate how well these relatively high-resolution models
capture spatial patterns of common severe-storm environments defined herein
to be when the upper quartile (over space) of WmSh exceeds 225 m2 s-2.
Analogous to , attention is restricted to spatial
patterns of frequency when conditioning on high field energy, defined to be
when the upper quartile over space is large.
Often climate models are evaluated based on subjective human observation,
which is limited because of human
bias cf.. Therefore, one main
objective of this paper is to demonstrate how very recently proposed
techniques from spatial weather forecast verification can be employed in the
climate setting to describe how well the models are able to capture the
frequency of severe-storm environments. For a review of these methods,
see , , and .
Many methods have been proposed since these reviews,
including , , , ,
, , , , , and ;
see http://www.ral.ucar.edu/projects/icp/references.html for a current
list of spatial forecast verification references.
Methods
For this paper, the focus is on evaluating patterns of the large-scale
indicators for severe weather conditional upon having high field energy. Following , high field energy is taken to
mean that the upper quartile (over space) of the variable of interest is
larger than its 90th percentile (over time). For example, from the
space–time process for WmSh, a new univariate time series is calculated that
represents the upper quartile of WmSh over space; this univariate time
series is called q75. Then, for time points when q75 is large (defined to be when it is
greater than its 90th percentile over the entire time series) the
frequencies of WmSh exceeding 225 m2 s-2 are found for each grid
point, resulting in a single spatial field that summarizes where the most
intense severe-storm environments are found. This resulting summary field is
denoted ω, and represents the frequency at each grid point when severe-storm environments occur most often. Figure shows ω for
the NARR and each model configuration from Table . Similarly,
CAPE alone is also analyzed, and its conditional frequency field (for CAPE ≥ 1000 J kg-1)
is denoted, κ (Fig. ).
For convenience, Table displays some of the notations used
throughout the text.
Images of the κ frequencies for NARR and each NARCCAP model
configuration.
A CAPE value of 1000 J kg-1 is an arbitrary choice, but represents
a value associated with severe weather environments as found in previous
studies e.g.,.
The value of 225 m2 s-2 is also an arbitrary choice, but is around the
value obtained when converting CAPE from 1000 J kg-1 to Wmax
and then multiplying by S =5 ms-1, which again results in a strong
severe-storm environment. Of course, WmSh could have a value of 225 m2 s-2 with far lower CAPE (i.e., with higher S), but because CAPE is
also conditioned to be at least 100 J kg-1 and S at least 5 m2 s-2, the environment is guaranteed to be conducive to severe
weather.
Some notations and abbreviations used in this manuscript.
Convective available potential energy
CAPE (J kg-1)
Maximum updraft velocity (2⋅CAPE)
Wmax (ms-1)
0–6 km vertical wind shear
S (ms-1)
Product of Wmax and S conditional upon
WmSh (m2 s-2)∗
CAPE ≥ 100 J kg-1 and
5 ≤ S ≤ 50 ms-1
Time series of the upper quartile of a random
q75
variable, x, taken over space
High field energy
q75 larger than its 90th percentile
(taken over time)
Frequency of CAPE ≥ 1000 J kg-1
κ
conditional upon high field energy
Frequency of WmSh ≥ 225 m2 s-2
ω
conditional upon high field energy
∗ Previous studies namely,
use the abbreviation WmSh to refer to the straight product of Wmax⋅S, whereas
here it has the further constraint that CAPE must be at least 100 J kg-1
and 5 ms-1≤ S ≤ 50 ms-1.
Spatial pattern and location displacement measures
Several methods are available to summarize the performance of a model in
terms of how well it captures the spatial patterns, locations, and general
shape of observed variables for various thresholds of intensity, and we
summarize the three used in the present study, namely: (a) Baddeley's
Δ, (b) mean error distance, and (c) the forecast quality index (FQI).
Most of these summary measures can be calculated from distance maps (cf.
Fig. , which shows an example). A distance map is a graphic
that shows, for each grid point, x, the shortest distance from x to the
nearest “event”, where an event is defined by a grid cell that exceeds the
given threshold. Of particular interest is the image in the third row of the
figure, which shows the absolute difference in distance maps for binary
fields A and B; several popular measures are derived directly from this
difference field. Also of interest is to mask out the distance map A using
the binary field B, and vice versa (bottom row of the figure).
Top row: two binary images with dark blue showing event areas in A
and B, respectively. Second row: distance maps for the images in the top
row (shortest distances, in numbers of grid squares, from each
pixel/grid point to an event in A or B). Third row: absolute values of
the differences between the distance maps in the middle row. Bottom left:
image from the second row left masked by the image from the top row right,
and bottom right is the image from the second row (right) masked by the image
in top row (left).
The Hausdorff distance is a widely known location measure that simply takes
the maximum value of the absolute differences between distance maps for two
fields (e.g., the maximum value from the image in the third row of
Fig. ). The metric has often been modified in order to have
measures that are not as sensitive to small changes in the two fields
resulting from taking the maximum value. The Baddeley Δ image
metric , for example, is a modification of
the Hausdorff metric that replaces the maximum with an Lp norm, and, in
general, further modifies the shortest distances between each grid point and
the event usually by setting any distances greater than a certain amount to a
constant. That is,
Δfp(A,B)=1N∑|f(d(si,A))-f(d(si,B))|p1/p,
where N is the total number of grid points, the summation is over
all points, s, in the grid, and f is any continuous function on
[0,∞) such that f(x+y)≤f(x)+f(y) and is strictly increasing at
zero with f(t)=0 if and only if t=0. For example, a common choice is to
take f(x)=min{x,constant}. The function's purpose is to eliminate
edge effects, but the metric is highly sensitive to the choice of constant.
The parameter p is chosen by the user. If p=1 a straight average of
values is achieved. The Hausdorff metric can be regained by letting p tend
to infinity. In the limit as p approaches zero, one obtains the minimum of
the portion within the absolute values of Eq. ().
In terms of Fig. , Baddeley's Δ metric applies a
function f to each of the two fields in the second row, then takes the
absolute difference between these fields, and finally takes the Lp norm
over the resulting image. Following , here,
f(x)=min{x,constant}, but the constant is set to infinity and p
to two, so Δ is simply the L2 norm of the image in the third row of
the figure.
The mean error distance (MED) is the mean of the distance map for one field
taken over only the events in the other field. In other words, again using
Fig. as an example, MED(A, B) averages the image in the second
row and first column over the event space defined by the binary field for B
(top row second column). Note that the MED is not symmetric because, in
general, MED(A, B) ≠ MED(B, A). In fact, the lack of symmetry provides
a useful measure for diagnosing how close (or far away), in terms of average
distance, forecast events are from those observed when calculating
MED(Observation, Forecast), and vice versa for MED(Forecast, Observation).
See Gilleland (2016b) for more information about this
approach. If the two values are close together, then it suggests
better agreement between the two fields in terms of both placement and
numbers of events. For example, in the figure, MED(A, B) would be rather
small (average of the non-white areas in the bottom left panel) because B
does not have any events far away from A, which is generally indicative of
good agreement between the two fields. However, MED(B, A) would be
comparatively large (average of the non-white areas in the bottom right
panel) because field A has a large event that B lacks.
A further modification of the Hausdorff distance, proposed
by specifically for verifying high-resolution
forecasts, normalizes the measure through an average of partial Hausdorff distances (PHD) obtained for surrogate fields, which results in a measure
that, like Δ, has desirable mathematical properties. It is one of the
rare location metrics that also incorporates intensity information in
addition to just spatial pattern information. First, stochastic realizations
of the observed process that are forced to have the same Fourier spectra,
probability density function, and spatial correlation structure as the
observed field, called surrogate fields, are drawn to be used as a
normalizing factor. Then, the FQI between two fields
A and B, where A is the observed field, and Ci is the ith of n
surrogate realizations of A, is given by
FQI(A,B)=PHDk(A,B)1n∑i=1nPHD(A,Ci)2μAμBμA2+μB2⋅2σAσBσA2+σB2.
Here, PHDk is the partial Hausdorff distance using the kth largest
shortest distance value, μ and σ denote the mean and standard
deviation, respectively, over the field. The denominator on the right is
derived from another image summary index called the universal image quality
index UIQI;, which is the denominator on the right
multiplied by the correlation between the two
fields. refer to the denominator as the modified
UIQI; the first component of which is a measure of the model field's bias,
and the second of the variability. The UIQI ranges between -1 and 1, with a
value equal to one indicative of a perfect match between the two fields. A
smaller value of UIQI indicates a lot of variability.
Feature-based analysis
Numerous methods are used in meteorology that fall under the category of
feature based. The idea is to identify individual features within a field,
and analyze those features for various characteristics, for example, those
discussed in Sect. above. In this study, the method
proposed by is loosely followed. In meteorological
applications, fields often have multiple features of interest, which would
also be the case in the present context if a much larger domain were
employed. However, even with a larger domain, the relative smoothness of
these climate fields results in few distinct objects. Subsequently, it is
very straightforward to match features between fields e.g., as was
proposed bywith the Method for Object-based Diagnostic Evaluation, or
MODE, as well as to merge
features within a field. The fields of κ and ω demonstrate only
one or two features in each model field, all of which are clearly matched
with the one NARR feature that shows up. Therefore, the task is fairly
simple.
Here, features are defined by simply identifying contiguous grid points that
exceed a threshold of 75 % frequency; indicating areas where storm favoring
environments are found to occur very often. Various summary properties are
evaluated and compared in the present study. Focus is centered on the
distances between the centers of mass of the features between each field
(centroid distance), the area ratio (defined to be the area of the smaller
feature divided by the area of the larger one; though here the model feature
areas are always smaller), the common intersection area (given as a percent),
and Baddeley's Δ described in section . Other
properties are shown for information, but are not highly useful as
comparative measures.
Field deformation
Field deformation methods deform the spatial locations of the forecast field
so that the values of the forecast variable better align with those of the
observations. Numerous different methods for deforming the field are
available, and many have been proposed in the atmospheric science literature
for forecast
verification e.g.,,
forecast
calibration e.g.,,
and data assimilation e.g.,
as well as short-term forecasting e.g.,. In this
study, we follow the image warping approach of , which
utilizes a pair of thin-plate spline transformations cf.chapter
10 to make the deformation mapping, which maps a subset of
k control locations from the observed field, or 0-energy field, to k
locations in the forecast field, dubbed the 1-energy field.
In many image warping applications, the set of 0- and 1-energy control
locations are easily found by hand. For example, if comparing the images of
one person's face to another, it is easy to choose corresponding features,
such as the point of the nose or the top of the head. While it is also
possible to choose the control locations by hand, an alternative approach is
preferred here, whereby the 1-energy control locations are found through a
numerical optimization procedure. The objective function to be optimized
follows the approach of ,
, , and
originally introduced by , which is effectively the
root mean square error (RMSE) between the observed and deformed forecast
field plus an additional penalty term for mappings that cover too much
distance or are too nonlinear. The latter term helps to prevent obtaining
deformations that yield non-physical deformations, such as folding. The
penalty term includes a precision matrix, or the inverse of a covariance
matrix for the mappings. employ a precision matrix that
is zero for locations separated by a certain distance, and positive
otherwise, which helps to obtain deformations where nearby locations move in
similar directions. In this study, the bending energy matrix, described
below, is used for the deformation, which penalizes nonlinear deformations.
The pair of thin-plate spline transformations is the bivariate function,
Φ(s)=(Φ1(s),Φ2(s))T=a+Gs+WTψ(s), where the set of all locations, s, in
the domain is d×1 (here, the dimension d=2), and
ψ(s)=(ψ(s-p0,1),…,ψ(s-p0,k))T,
where p0,i,i=1,2,…,k are the 0-energy control locations, and
ψ(h)=‖h‖dlog(‖h‖),if‖h‖>0anddmod2=0‖h‖d,if‖h‖>0anddmod2≠00,else.
The mapping has dk+d2+d parameters: (a) the d×1 vector, a,
(b) d×d matrix G, and (c) the k×d matrix
W, which for d=2 results in 2k+6 parameters. The
natural thin-plate splines used herein are subject to the further
constraints that the columns of coefficients in W sum to zero
(i.e., 1TW=0) and that the sum of the products of these
coefficients times the 0-energy control locations is also zero (i.e.,
p0TW=0). The set of equations can be written succinctly in
matrix form as
LA=Ψ1kp01kT00p0T00WaTGT=p100.
The inverse matrix, L-1, is of particular importance because
when performing the numerical optimization, this matrix needs only to be
calculated one time at the beginning, and it defines the resulting
warp function, which is a linear function of the 1-energy control
locations and the upper left k×k partition of L-1,
denoted by L11. That is, W=L11p1.
The matrix L11 is also known as the bending energy matrix because it determines the amount of the nonlinear deformation; the
matrices a and G give the linear, or affine, part of the
deformation.
Note that because of the constraints on W, there are also three
constraints on L11. Namely, 1kTL11=0
and p0TL11=0 (recalling that p0 is
k×2). The transformations imposed by Eqs. ()
and () minimize the total bending energy of all other
possible interpolating functions from the 0-energy control locations
p0 to the 1-energy control locations p1, and the total
minimized bending energy (referred to henceforth as simply the bending
energy) is easily found from
trace(p1TL11p1).
In order to find the optimal mapping of p0 to p1,
the 0-energy control locations are chosen and fixed, and then the p1
locations are moved until an objective function is minimized. Denoting the
1-energy field by Z^ and the 0-energy field by Z, the objective
function used here is the same as that of , and is
given by
Q(p1)=12σε2∑Ni=1Z^(Φ(si))-Z(si)2+β(p1-p0)xTL11(p1-p0)x+(p1-p0)yTL11(p1-p0)y,
where the x and y subscripts denote the two component coordinates of the
control locations, β is a penalty term chosen a priori to
determine how much or little nonlinear warps should be penalized, and
σε is a nuisance parameter giving the error variance
between the 0-energy and deformed 1-energy fields. The objective function (Eq. ) results from the penalized likelihood under an assumption of
Gaussian errors between the 0-energy and deformed 1-energy fields, and
potentially provides a means for obtaining confidence intervals on the
deformations, but this potential will be left for future work.
began with two identical and regular sets of
control locations, and used a multi-step procedure that begins with four
control locations and a highly smoothed set of fields, ratchets the number of
control points up iteratively with decreasingly smoothed sets of fields to
minimize the objective function Q from Eq. (), which enables a
completely automated method for finding the optimal deformation. Because
there are only 9×2=18 warped fields to find here, the domain is
relatively small, and only a very small number of control locations are
required four to eight, whereas 200 were used in,
a less automated procedure is employed in this work. First, about four
control locations are selected by hand in the 0-energy field, and an attempt
is made, again by hand, to identify where those locations map to those in the
1-energy field. These 1-energy control locations are then used as initial
values in the numerical optimization routine used to minimize Q.
Spatial prediction comparison test
The spatial prediction comparison test (SPCT) is a test introduced
by that is a spatial modification of a similar time
series test introduced by , and it provides a
statistical hypothesis test for two competing forecast models, m1 and
m2, compared against the same observation, a that accounts for spatial
correlation. found the test to be both powerful and
of the right size provided the range of dependence is not too long, even in
the face of contemporaneous correlation (i.e., when m1 is correlated with
m2). First, a loss function, g, must be chosen and applied to each model
against a, giving g1=g(m1,a) and g2=g(m2,a). For example, absolute
error (AE) loss yields g(x,y)=|x-y|. Then the loss differential field, d,
is calculated by taking the difference at each spatial location between g1
and g2 (i.e., g1-g2).
After checking for the existence of spatial drift in the spatial loss
differential field, and removing any spatial trend before proceeding, the
empirical variogram for d is found, say γ^, using all lags up
to half of the maximum possible lag for the study region. Next, a parametric
variogram model is fit to γ^; following ,
the exponential variogram is used here. The test statistic is the usual
Student's t or normal approximation for the paired sample test of the mean
of the loss differential field, but where the standard error is estimated by
averaging the values from the spatial correlation function, by way of a
linear combination of the parametric variogram fit to γ^ over all
lags of the domain. In other words, the test statistic, Sv, to test
whether or not the mean loss differential, d‾, is significantly
different from zero is
Sv=d‾se^(d‾),
where
se^(d‾)=∑i∑j[γ^(∞;θ^)-γ^(hij;θ^)],
with θ^ estimated parameters of the parametric variogram model
evaluated at each lag hij. Because the spatial fields presently studied
all have a reasonably large number of grid points, the normal approximation
of the test is used throughout.
Here, the test is conducted under AE loss first, and then with AE + deformation loss for comparison. The latter was proposed
by , and allows for both spatial displacement/pattern
error and intensity errors to be simultaneously incorporated into the test,
while also accounting for spatial correlation. The loss is achieved by
finding the AE between the observed field and the deformed forecast field
(cf. Sect. ) and adding these errors to the (Euclidean)
distance each point “traveled” in order to achieve the re-aligned field.
Results
Figure shows the results for the location measures for
ω, the frequency of WmSh greater than 225 m2 s-2
conditional on high field energy (see Table for notation).
It can be argued from visual inspection of the graphic that the models driven
by the HadCM3 global model are closest to reproducing the patterns of
ω associated with the NARR reanalysis. This result is consistent
across the thresholds for mean error distance, but the HRM3–HadCM3 has higher
(worse) Baddeley Δ metric values for the highest thresholds. In terms
of capturing the spatial structure of the most frequent events for WmSh, the
CCSM3-driven runs are the least similar to those found in the NARR, and the
WRFG–CGCM3 performs the worst in terms of capturing the spatial patterns of
ω according to the Baddeley Δ metric; the results for κ
(not shown) are similar. Of course, these results do not account for sampling
uncertainty, so no conclusions can be made with statistical significance
based on these measures.
Baddeley's Δ (p=2, c=∞; top left), mean error
distance conditioning on observed events (top right) and mean error distance
conditioning on “forecast” events (bottom left) for κ. Shapes
indicate the regional model: CRCM (circles), HRM3 (diamonds), MM5I (squares),
and WRFG (triangles). Colors indicate the driving models: CCSM3 (black),
CGCM3 (gray), HadCM3 (blue), NCEP (orange).
Feature identification and properties for frequency
of ω. Features identified using a threshold of 75 % frequency.
Feature
Centroid
Feature
Orientation
Aspect
Intensity
Intensity
number
area (grid squares)
angle (∘)
ratio
(lower quartile)
(90th percentile)
NARR
1
(-106.51,37.97)
1825
111.62
0.55
0.82
0.92
CRCM–CCSM3
1
(-121.11,37.84)
97
70.88
0.61
0.78
0.88
2
(-97.87,38.07)
1317
112.97
0.57
0.78
0.88
CRCM–CGCM3
1
(-105.99,38.10)
1502
111.67
0.66
0.79
0.91
HRM3–HadCM3
1
(-104.08,38.01)
2030
112.98
0.54
0.81
0.90
MM5I–CCSM3
1
(-120.65,37.56)
104
64.51
0.89
0.77
0.86
2
(-99.30,37.98)
718
114.73
0.57
0.77
0.82
MM5I–HadCM3
1
(-117.76,37.98)
377
110.29
0.58
0.79
0.93
2
(-116.14,37.81)
296
117.18
0.77
0.79
0.88
WRFG–CCSM3
1
(-108.76,37.96)
613
56.60
0.62
0.77
0.91
2
(-107.30,37.80)
346
43.07
0.65
0.78
0.88
WRFG–CGCM3
1
(-107.28,38.00)
849
7.39
0.60
0.76
0.82
CRCM–NCEP
1
(-105.04,37.91)
1736
113.21
0.63
0.79
0.90
WRFG–NCEP
1
(-109.13,37.99)
1515
67.09
0.71
0.83
0.99
Merged and matched feature comparisons for ω. Features
identified using a threshold of 75 % frequency. Minimum boundary separation
is zero for all comparisons. Total interest is given in parentheses below
model name.
Features compared
Centroid distance
Angle
Area
Intersection
Bearing
Baddeley Δ
(NARR vs. model)
(grid squares)
difference (∘)
ratio
area
(∘ from north)
(p=2, c=∞)
CRCM–CCSM3
1 vs. (1 and 2)
7.05
46.99
0.77
0.69
88.74
2.98
CRCM–CGCM3
1 vs. 1
0.53
0.05
0.82
0.85
107.77
2.39
HRM3–HadCM3
1 vs. 1
2.42
1.36
0.90
0.90
90.42
3.43
MM5I–CCSM3
1 vs. (1 and 2)
4.50
4.37
0.45
0.59
87.99
7.82
MM5I–HadCM3
1 vs. (1 and 2)
10.54
33.58
0.37
0.54
-86.34
13.128
WRFG–CCSM3
1 vs. (1 and 2)
1.73
44.21
0.53
0.63
-86.73
10.74
WRFG–CGCM3
1 vs. 1
0.77
44.23
0.47
0.62
-93.00
12.14
CRCM–NCEP
1 vs. 1
1.47
1.60
0.95
0.91
86.80
1.52
WRFG–NCEP
1 vs. 1
2.62
44.52
0.83
0.90
-89.77
7.88
A feature-based analysis is also conducted (Tables
and ), which provides similar, but more detailed
information about how the fields compare to one another. Tables
and show summary statistics for identified ω features
(Table ) and feature comparisons for matched (possibly first merged)
ω features (Table ) after having set a threshold of having
at least 75 % frequency of occurrence. In each case, it is clear that the
HRM3–HadCM3 does the best job of all of the models at achieving a roughly correct spatial
pattern for the most frequent ω areas. It has a relatively low centroid distance,
angle difference, and Baddeley Δ value, as well as one of the highest area ratios (0.90)
and intersection areas (also 0.90; tied for highest with WRFG–NCEP). Moreover, it has the
same number of identified features above the 75 % threshold as the NARR. For all of
the fields, the largest feature is in the southeast corner of the domain over the ocean,
and in most cases hugs the border, suggesting that high CAPE would be modeled beyond the
edge of the domain.
Results for κ (not shown) are similar. The bearing is calculated from
the model feature centroid to the NARR feature centroid with north as the
reference, which simply gives a sense of the direction in which the features
of one field are situated with respect to the other. For a model whose output
variable has small separation distance and good area overlap with the
observed feature (e.g., CRCM–CGCM3), the bearing is perhaps not very
meaningful. However, for those with larger separation distances and less area
overlap (e.g., the models here have fairly good spatial pattern matches, but
MM5I–HadCM3 is a candidate for checking the bearing to see if the problem
exists for other variables), then the bearing could prove useful to a modeler
hoping to diagnose how the model failed.
At lower thresholds than 75 % frequency (not shown), an additional area of
high frequency is generally observed in the southwest near or over Baja California. Careful inspection of models using the CCSM3 as the driving model
reveals that there is a tendency for more numerous, but smaller, features
than produced by the NARR or other driving models (cf.
Table ). In each case, these disjoint features are
merged (using centroid distance as the primary criterion) before comparing
with the NARR as they are primarily located in the southeast region.
The values in Table can be combined into a single
summary very effectively using the fuzzy logic algorithm described
in , which yields a measure called total
interest that incorporates user-specified weights in order to obtain a
measure based on the attributes of a feature that are most important. It
ranges between zero and 1 where a value of 1 indicates a perfect match
and the worst value is zero. The technique is performed for these features
using the same interest maps and weights as proposed
in . All of the total interest values are very high,
ranging from 0.91 to 0.94, indicating good agreement between the models and
the NARR.
Results from deforming climate models to better spatially
align with NARR reanalysis for κ. RMSE0 is the original
RMSE, RMSE1 the resulting RMSE between the deformed model and NARR,
and the bending energy is a summary measure of the amount of nonlinear
deformations applied to deform the field.
RMSE0
RMSE1
% RMSE
Bending
reduction
energy
CRCM–CCSM3
0.214
0.1393
35
0.9555
CRCM–CGCM3
0.1467
0.1028
30
1.0739
HRM3–HadCM3
0.1569
0.11
30
0.2531
MM5I–CCSM3
0.2665
0.1605
40
2.0042
MM5I–HadCM3
0.1477
0.084
43
0.6933
WRFG–CCSM3
0.2493
0.0961
61
3.2692
WRFG–CGCM3
0.2406
0.0918
62
3.3178
CRCM–NCEP
0.214
0.1727
19
0.2545
WRFG–NCEP
0.1711
0.0923
46
0.4304
It is also of interest to determine if one model stands out above others. To
do so, we use the SPCT with AE loss, which is a very conservative test
because small-scale errors and spatial displacements are not taken into
account, and none of the results is statistically significant at any
reasonable level suggesting that the null hypothesis of equal performance (as
measured by the mean AE loss differential) cannot be rejected. In order to
factor in spatial alignment and small-scale errors to the test, the SPCT is
also applied with AE + deformation loss following .
Indeed, inspection of the graphs of κ (Fig. ) clearly
reveals that some models capture the spatial patterns of the high-event
frequency CAPE areas better than others. Field deformation techniques are
well-established methods for verifying forecasts spatially where small
mis-alignments in space obfuscate model performance.
Table displays the results of having found the
optimal deformation for each model deformed to better align spatially with
the NARR reanalysis. Shown are the original RMSE, denoted RMSE0, the RMSE
after having applied the optimal deformation, RMSE1, the percent reduction
in RMSE, and the minimum bending energy required to arrive at the optimal
re-alignment. The minimum bending energy is not a summary of the entire
deformation, only the non-affine ones. Thus, a small bending energy does not
imply that the deformation is necessarily small, but rather that nonlinear
distortions are not abundant. However, the bending energy is useful as a
comparison because a field, A, with higher bending energy than a field, B,
implies that A matches less well than B with the 0-energy field in terms of
overall shapes of patterns. A perfect model would have zero RMSE0 and thus
no reduction in error or bending energy. A good model will have a low
RMSE0 paired with low bending energy and often a relatively high reduction
in error. A bad model will have relatively high RMSE0 and either high
reduction in error paired with high bending energy, or low reduction in error
paired with low bending energy.
Figures – display examples of the resulting
field deformations for κ, for typical deformations for these cases
(Fig. ), the HRM3–HadCM3 (Fig. ), which
requires very little deformation because the original field is already
closely aligned with the NARR, and a case where the spatial alignment (and
intensities) are fairly poor; resulting in a more tortured deformation
(Fig. ). In most cases, a small amount of affine and
nonlinear deformation results in considerable error reduction. The cases
that require more nonlinear deformations (MM5I–CCSM3, WRFG–CCSM3, and
WRFG–CGCM3; latter two not shown) stand out in both the “distance
traveled”
and “deformed 1-energy” panels for requiring a relatively large amount of
deformation in order to match well with the NARR data product.
Deformation results
for κ. Top left is NARR reanalysis (0-energy field), top middle is
CRCM–CCSM3 (1-energy field), top right is the error between NARR and CRCM–CCSM3.
Bottom left shows the distance that the intensity “traveled” to arrive at each
grid point, bottom middle is the deformed CRCM–CCSM3 field, and bottom right
is the error field between NARR and the deformed CRCM–CCSM3.
Severe thunderstorms require high CAPE, which is basically a measure of the
amount of energy available to create very strong updrafts in thunderstorms.
High CAPE environments have a warm, moist boundary layer, with colder air
aloft, the latter of which increases conditional instability. Proximity to
warm, large bodies of water in the domain (i.e., the Gulf of Mexico,
Caribbean, and Gulf of California) plays a large role in dictating the
spatial distribution of high CAPE in the domain as they are the primary
sources of moisture. Moisture transport mechanisms also play a role. High
CAPE does not often occur at high elevation or near the west coast because
near-surface moisture is too low and/or near-surface temperature is too cold.
In the CCSM3-driven simulations, the RCMs inherit an atmosphere that is too
dry from the CCSM3 in the
warm season . This dryness would
strongly effect the frequency of high CAPE values in the central part of the
country during the dominant season for severe weather in the region. In the
MM5I vs. the CRCM–CCSM3-driven simulations, it is likely that moisture
transport mechanisms simulated by the regional models are playing a strong
role in dictating the distribution of moisture, thus resulting in the spatial
distribution of high CAPE frequencies east of the Mississippi River. In the
HRM3–HadCM3, it is likely that warm-season low level winds are a bit too
southeasterly through the Plains, carrying more moisture into the High Plains
and Rocky mountain region than is observed, leading to the high CAPE
frequencies seen from central Mexico north through Wyoming and eastern
Montana.
Same
as Fig. , but for HRM3–HadCM3.
Same as
Fig. , but for MM5I–CCSM3.
The reductions in error range from only about 19 % to almost 62 % (WRFG–CCSM3
and WRFG–CGCM3); the WRFG–NCEP case had the third highest reduction in error.
Indeed, the WRFG model combinations had some of the worst spatial alignment
with the NARR, so the improvement induced by re-alignment is the most
drastic. It should be noted, however, that the deformations for the two WRFG
cases, WRFG–CCSM3 and WRFG–CGCM3, also have the largest amount of nonlinear
deformation with minimum bending energies much greater than any other model
combinations. Inspection of the graphs of the deformations (not shown)
suggests that the linear deformations are also large for these cases.
HRM3–HadCM3 has the least amount of bending energy, and only a small amount
of affine displacements from the NARR. Nevertheless, with only a small amount
of deformation, this model still achieves a reduction in RMSE, which is small
to begin with, by almost 30 %.
Field deformation results for ω (Table ) are,
not surprisingly, similar to those for κ, with percent reduction in
RMSE ranging from about 16 % to about 50 %. Bending energies are similar,
where the MM5I–CCSM3 again requires the most bending energy, but this time at
a much higher value of almost six. Results for the CRCM–CGCM3 configuration
are arguably the worst with a relatively large RMSE0 of about 0.11, and a
very small reduction in error of only about 11 % that is achieved only after
requiring a relatively high amount of bending energy (≈1.15).
Same as Table , but for ω.
RMSE0
RMSE1
% RMSE
Bending
reduction
energy
CRCM–CCSM3
0.1313
0.1006
23
0.0228
CRCM–CGCM3
0.1081
0.0966
11
1.1484
HRM3–HadCM3
0.0977
0.0673
31
0.265
MM5I–CCSM3
0.1802
0.1101
39
5.988
MM5I–HadCM3
0.1308
0.091
30
1.132
WRFG–CCSM3
0.1849
0.0914
51
0.9347
WRFG–CGCM3
0.1948
0.1087
44
0.8507
CRCM–NCEP
0.0939
0.0781
17
0.2466
WRFG–NCEP
0.1242
0.0704
43
0.5423
SPCT results when AE + deformation loss is applied to κ.
Results shown are only for those cases with p values ≤ 50%. Values
shown are the mean loss differential statistic and associated p value in
parentheses. Bold face emphasizes the “better” model according to the test;
where negative (positive) values mean model 1 (model 2) is better. (∗∗∗)
indicates significance at the ≈0 % level, (∗∗) at the 5 %
level, (∗) at the 10 % level, (†) at the 20 % level. Note, the
CRCM–NCEP case is not included because a good-fitting variogram could not be
found for any of the loss differential fields associated with this
model.
Model 1
Model 2
SPCT
p value
statistic
CRCM–CCSM3
CRCM–CGCM3
-1.24
0.21
CRCM–CCSM3
HRM3–HadCM3
1.15
0.25
CRCM–CCSM3
WRFG–CGCM3
-0.94
0.35
CRCM–CGCM3
MM5I–CCSM3
1.02
0.31
CRCM–CGCM3
HRM3–HadCM3
1.66
0.10 ∗
CRCM–CGCM3
MM5I–HadCM3
1.24
0.22
CRCM–CGCM3
WRFG–NCEP
0.85
0.40
HRM3–HadCM3
MM5I–CCSM3
-1.71
0.09 ∗
HRM3–HadCM3
MM5I–HadCM3
-0.75
0.46
HRM3–HadCM3
WRFG–CCSM3
-1.45
0.15 †
HRM3–HadCM3
WRFG–CGCM3
-3.06
0.002 ∗∗∗
HRM3–HadCM3
WRFG–NCEP
-0.89
0.37
MM5I–CCSM3
WRFG–CCSM3
-0.88
0.38
MM5I–CCSM3
WRFG–CGCM3
-2.12
0.03 ∗∗
MM5I–HadCM3
WRFG–CCSM3
-0.96
0.33
MM5I–HadCM3
WRFG–CGCM3
-1.45
0.15 †
WRFG–CCSM3
WRFG–CGCM3
-0.91
0.36
WRFG–CCSM3
WRFG–NCEP
0.79
0.43
WRFG–CGCM3
WRFG–NCEP
1.42
0.16 †
Following , the SPCT is applied with AE plus deformation
loss induced by the above deformations. Some relatively significant results
are now found; including one case with better than 1 % significance, one with
better than 5 % significance, two cases with better than 10 % significance,
and three with about 15 % significance. Table displays
the test results for the cases where the p value is less than 0.50. As
mentioned above, the HRM3–HadCM3 model appears to be the closest to the NARR
in terms of spatial pattern and location, as well as having about the right
frequencies in these areas; only a relatively small amount of deformation is
needed to optimize the alignment. Subsequently, it is no surprise that this
model is shown to be better than all the other models; three of which are
significantly better at the 10 % level or better according to the SPCT with
AE + deformation loss. Models with the HadCM3 component generally fared very
well under this test, and the MM5I combinations also fared well. In general,
the worse models failed to capture the spatial extent of areas with
frequently high values of CAPE and WmSh. They tend to miss, or underpredict,
the high-frequencies in the northwest extending to eastern Colorado and
Wyoming compared with NARR. They also tend to project considerably less
frequency in the southwestern part of the domain.
Despite the fact that the ω deformation results are similar to those
for κ, the SPCT with AE + deformation loss results are less similar.
However, the HRM3 configurations do still tend to outperform other models, in
one case with statistical significance at almost the 10 % level
(Table ).
Same as Fig. , but for ω.
Model 1
Model 2
SPCT
p value
statistic
CRCM–CCSM3
CRCM–CGCM3
-1.09
0.27
CRCM–CCSM3
MM5I–CCSM3
-1.31
0.19 ∗
CRCM–CCSM3
MM5I–HadCM3
-0.84
0.40
CRCM–CCSM3
WRFG–CCSM3
-1.03
0.30
CRCM–CCSM3
WRFG–CGCM3
-0.82
0.41
CRCM–CGCM3
CRCM–NCEP
0.76
0.45
HRM3–HadCM3
MM5I–CCSM3
-1.64
0.10 ∗
HRM3–HadCM3
MM5I–HadCM3
-0.82
0.41
HRM3–HadCM3
WRFG–CCSM3
-1.26
0.21
HRM3–HadCM3
WRFG–CGCM3
-0.8
0.43
MM5I–CCSM3
WRFG–CCSM3
0.9
0.37
MM5I–CCSM3
WRFG–CGCM3
0.85
0.40
MM5I–CCSM3
CRCM–NCEP
1.63
0.10 ∗
MM5I–CCSM3
WRFG–NCEP
1.38
0.17 ∗
MM5I–HadCM3
CRCM–NCEP
0.82
0.41
WRFG–CCSM3
CRCM–NCEP
1.76
0.08 ∗
WRFG–CCSM3
WRFG–NCEP
1.36
0.17 ∗
WRFG–CGCM3
CRCM–NCEP
1.33
0.18∗
WRFG–CGCM3
WRFG–NCEP
1.36
0.17 ∗
CRCM–NCEP
WRFG–NCEP
-0.76
0.45
Conclusions
In this study, several advanced weather forecast verification techniques for
high-resolution gridded verification sets are applied in a novel way to
severe-storm indicators from several of the North American Climate Change
Assessment Program (NARCCAP) climate models. In particular, focus is placed on the
distributional property of how well the models capture the frequencies of
severe-storm environments when the field energy is high, where field energy
is defined by the upper quartile over space and this field energy is
considered to be high when it is in the upper 90th percentile over time. For
ease of discussion, we denote κ to be the frequency of CAPEs
exceeding 1000 J kg-1 conditional upon high field energy for CAPE,
and similarly, ω to be the frequency of WmSh's exceeding 225 m2 s-2 conditional upon high field energy for WmSh, where WmSh is equal
to 2⋅CAPE⋅S, and S denotes 0–6 km vertical wind
shear (ms-1), provided that CAPE ≥ 100 J kg-1 and 5 ≤ S ≤ 50 (zero otherwise). Previous studies found concurrently high
values of CAPE and S to be important indicators of severe-storm activity, and
the derived WmSh indicator from these coarse-scale variables discriminates
severe-storm activity well as a univariate variable.
In general, the NARCCAP runs under estimate the spatial extent of high
frequency κ and ω where the HRM3–HadCM3 model run performs the
best; having an area ratio near unity at ≈90 % and an intersection
area of about 89 % for κ and about 90 % in both categories for
ω. The CRCM–NCEP run is the next best in this regard with only
slightly lower ratios. For ω the numbers are similar for these models,
although the CRCM–NCEP has slightly better overlap (intersection area about
0.91 vs. 0.84) and the CRCM–CGCM3 also has a high area ratio (0.82 vs. 0.60)
and intersection area (0.85 vs. 0.66). Otherwise, the area ratios for high
frequencies for most models range between about 20 and 60 % (0.91 for
HRM3–HadCM3) for κ and between about 35 and 95 % for ω;
results are similar for intersection area for both frequencies.
The application of binary image metrics suggests that overall the models do
reasonably well at capturing high-frequencies of κ and ω, but
for the very high-frequency areas, some models perform less well. In
particular, mean error distance and Baddeley's Δ are applied, and for
thresholds above 80 %, it is found that the best models at capturing κ
are those that drive the regional models HRM3 and WRFG. The worst at
capturing the spatial patterns for κ are those with CRCM and MM5I
regional models, as well as those driven by CCSM3 and CGCM3. For ω,
the runs with NCEP as the driving model perform worse at capturing frequency
area patterns for frequencies above about 80%, as well as those utilizing
CCSM3 and CGCM3.
The above results consider only spatial areas of high-frequency severe-storm
environments. Two methods are utilized in this study to address both the
spatial alignment and intensity (i.e., frequencies) simultaneously: the
forecast quality index (FQI; not shown) and the spatial prediction comparison
test (SPCT) with absolute error (AE) + field deformation loss. For both
κ and ω, FQI results suggest that all of the models perform
best at capturing severe-storm environment frequencies and spatial patterns
of those frequencies for thresholds between about 30 and 75 %. At the
lowest thresholds, CRCM–CCSM3 and CRCM–NCEP stand out as being exceptionally
good for both κ and ω meaning that they may underpredict the
frequency of severe-storm environments, but they otherwise capture these
areas but with too few occurrences. The SPCT with AE + field deformation loss
is an overall estimation of how well the models perform directly (without
relying on setting thresholds). For κ, the HRM3–HadCM3 model is
clearly the best model, but models driven by CCSM3 fare well generally, as do
those with the CRCM regional model. For ω, the CRCM–CCSM3 is the clear
winner over all other models, but the HRM3–HadCM3 also performs well. The
MM5I regional model is generally outperformed by other models.
The utility of applying spatial forecast verification techniques for climate
model evaluation studies is presented, and the results of this study for
severe-storm environments provide important insight into how to interpret
future model runs for these NARCCAP models. In particular, caution is
required when considering very high frequencies for ω, and focus
should be restricted to more moderate thresholds. Moreover, the spatial
extent of future storm environments may be an underestimation from nearly all
of the model runs, and more weight should be put on the HRM3–HadCM3 run than
other models, with considerably less weighting on model combinations
involving CGCM3.
Some methods provide analogous information, which provides consistency in
ascertaining model performance, but each can provide its own unique
perspective depending on the fields in question. For example, image warping
is a highly complicated approach, which could be considered unnecessary for
simply inferring about how far off each model is from the NARR. On the other
hand, it provides the only method known to the current authors that provides
a statistical hypothesis test (or confidence intervals) that accounts for
both spatial correlation and displacement errors. The binary image metrics
such as the Hausdorff, partial Hausdorff, and Baddeley Δ all provide
distributional summaries of the absolute difference in distance maps between
two binary “event” fields, with Δ providing arguably the most useful
information. A summary of these measures can be found
in and . The
FQI incorporates such displacement information, but also intensity
information so that it may provide redundant information as these other
distance map-based measures, but depending on the intensities, it could also
yield different results. The feature-based approaches utilize many of these
same types of information, but inform about specific features within a field,
which in the present case is less important, but does describe how some
models have two smaller features instead of one large feature (i.e., area of
higher frequency κ/ω).