ASCMOAdvances in Statistical Climatology, Meteorology and OceanographyASCMOAdv. Stat. Clim. Meteorol. Oceanogr.2364-3587Copernicus PublicationsGöttingen, Germany10.5194/ascmo-3-55-2017Generalised block bootstrap and its use in meteorologyVargaLászlóvargal4@chello.huZempléniAndrásDepartment of Probability Theory and Statistics, Eötvös Loránd University, Budapest, HungaryLászló Varga (vargal4@chello.hu)14June20173155666June201612April20179May2017This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/3.0/This article is available from https://ascmo.copernicus.org/articles/3/55/2017/ascmo-3-55-2017.htmlThe full text article is available as a PDF file from https://ascmo.copernicus.org/articles/3/55/2017/ascmo-3-55-2017.pdf
In an earlier paper, emphasised the importance of
investigating the effective sample size in case of autocorrelated data. The
simulations were based on the block bootstrap methodology. However, the
discreteness of the usual block size did not allow for exact calculations. In
this paper we propose a new generalisation of the block bootstrap
methodology, which allows for any positive real number as expected block
size. We relate it to the existing optimisation procedures and apply it to a
temperature data set. Our other focus is on statistical tests, where quite
often the actual sample size plays an important role, even in the case of
relatively large samples. This is especially the case for copulas. These are
used for investigating the dependencies among data sets. As in quite a few
real applications the time dependence cannot be neglected, we investigated
the effect of this phenomenon on the used test statistic. The critical value
can be computed by the proposed new block bootstrap simulation, where the
block size is determined by fitting a VAR model to the observations. The
results are illustrated for models of the used temperature data.
Introduction
In the last decades the bootstrap methodology has become more and more
widespread in different areas of statistical applications. See, e.g.,
for a review of possible areas from spatial models to financial
data and data mining, where bootstrap may be used. In this paper we focus on
the effect of the serial dependence, naturally arising in many time series
data. The bootstrap samples must match the dependence within the data, so the
block bootstrap is the suggested method for bootstrapping time series.
investigated this approach in some detail, including suggestions
for selecting the optimal block size. In an earlier paper,
investigated the possibilities for using the block bootstrap methods for
checking the validity of the copula models. In this paper we present an
improvement to the classical block bootstrap methodology, which is especially
relevant in our applications.
In Sect. 2, we first briefly review the importance of stationarity of time
series. In the bivariate case, the
vector autoregression (VAR) process
is one of the most important models, becoming popular first in the area of
econometrics . For recent applications of VAR models in
meteorology, see for example , or .
investigated the applicability of VAR models to monthly
temperature, precipitation and wind speed data, and they found that in quite
a few cases VAR(1) was among the best models. We briefly present the main
properties of VAR models, which are used in the sequel and present the
notations.
In Sect. 3, we introduce the concept of copulas, the most convenient objects
for analysing the dependence structures among variables. Their history goes
back as far as , but their applications are much more recent. However,
they have spread very quickly to the most important areas – for a recent
analysis in meteorology, see . provided an overview of the
possible applications of copula models in meteorology, more specifically in
joint analysis of temperature and precipitation data. Most of these works use
different parametric copula models, but we are more interested in testing for
possible changes in the dependency structure, so we also introduce the most
recent approaches in testing homogeneity of such models, which are based on
the empirical copula process.
Section 4 is devoted to the bootstrap resampling method, including the block
bootstrap approach, which is suitable for the case of serially dependent
observations. Here we introduce a generalisation, which helps overcoming the
problem that originally the block size was supposed to be a natural number.
In our approach the block size is a random variable with arbitrary positive
real-valued expectation greater than 1, and it contains the original block
bootstrap as a special case. Due to this small variance in the sample size,
it overcomes the problem of extensive random error in the case of the
stationary bootstrap ().
Section 5 shows the results of our simulations regarding the properties of
the proposed homogeneity test. It turned out that it is consistent and it has
reasonable power for relatively small sample sizes. We have also investigated
the effect of the block size for the properties of the test.
In Sect. 6 we apply our approach to the gridded temperature data base of
E-OBS, which is a product of the EU-FP6 project ENSEMBLES . Here
we use the daily mean temperature data from the 0.5∘ grid. Our focus
is on checking for possible changes in the dependence pattern between the
grid point close to Budapest and some other grid points within the Carpathian
Basin. We show that in some cases there is a significant deviation from
homogeneity of the first and second part of the data. The application of
bootstrap methods in the context of statistical inference for copulas is a
recent but quickly expanding area.
proposed a quick method
for bootstrapping the p values in goodness-of-fit tests. For us the most
relevant paper is , where the consistency of the block bootstrap
method for the empirical copula is proven under general conditions.
The conclusion summarises our findings and gives some interesting open questions.
Vector autoregression (VAR) processes
We call the d-dimensional random variable series {Xt}t∈Z=(X0,X±1,X±2,…) a
time series if its elements are d-dimensional random vectors, which are
usually not independent from each other. Here we consider t as the time.
Let us assume that the random variables have finite second moments. The time
series {Xt}t∈Z is weakly stationary (hereafter
just called stationary), if neither the mean function E(Xt) nor the
covariance matrix Cov(Xt+s,Xt) depends on t for all s∈Z. Stationarity is an important property, it means that translation
invariance holds for its statistical properties (mean, variance,
autocovariance structure, etc., depending on the specific notion of
stationarity considered).
One of the most frequently applied time series models are the
autoregressive (AR) processes and their multidimensional counterparts,
vector autoregressive (VAR) models. In the following, we define the VAR(p)
process and give its main properties in two dimensions as this is necessary
to our applications.
The time series {Xt}t∈Z={(X1,t,X2,t)T}t∈Z is called a zero-mean two-dimensional VAR(p)
process if
Xt=A1Xt-1+A2Xt-2+…+ApXt-p+εt,
where A1,…,Ap are 2×2 parameter matrices,
the independent innovation process {εt}t∈Z is a two-dimensional white noise with
E(εt)=0=(0,0)T and
Cov(εt)=C a symmetric positive definite covariance
matrix. The VAR(p) process is stationary if the roots of the characteristic
polynomial
P(x)=det(I2-A1x-…-Apxp) lie
outside the unit circle.
Any VAR(p) process can be rewritten as a VAR(1) process in the following
way: Zt=AZt-1+et, where
Zt=XtXt-1⋮Xt-p+1,et=εt0⋮0andA=A1A2⋯Ap-1ApI20⋯000I2⋯00⋮⋮⋱⋮⋮00…I20.
This representation is convenient in calculating the autocovariances. An
equivalent condition for stationarity is that all the eigenvalues of the
coefficient matrix A are smaller than 1 in modulus. In this case,
the time series has a causal representation in terms of an infinite moving
average form Zt=∑i=0∞Aiet-i.
For the remainder of this section we assume that Xt is stationary.
Let us denote with ΓX(h)=E(X1+hX1T)
the autocovariance function of the process Xt.
ΓX(h) is a 2×2 matrix-valued function, the
symbols γi,j(h) stand for its elements. We denote with
ΓZ(h)=E(Z1+hZ1T) the 2p×2p matrix-valued autocovariance function of the process Zt. The
covariance matrix of Zt is ΓZ(0), which can
be determined by solving the matrix equation ΓZ(0)-AΓZ(0)AT=Cov(et). It is easy to see that
for 1≤h∈Z, the autocovariances can be calculated by
ΓZ(h)=AhΓZ(0). The powers of the
matrix A can easily be computed using the spectral decomposition. Lastly,
we need the autocovariance matrix of the original process, and by
construction, it is the upper left 2×2 submatrix of
ΓZ(h).
In the applications we will use the covariance matrix of the sample mean. The
following asymptotic result will be crucial in our investigations: if
∑h=-∞∞|γi,i(h)|<∞ for i=1,2,
then
n⋅trCov(X‾n)⟶∑i=12∑h=-∞∞γi,i(h)as n→∞,
where tr(⋅) denotes the trace of a matrix.
It is important to check if the chosen time series model is adequate. If the
model fits well, the fitted residuals should behave as a realisation of a
white noise process. The hard part is to check whether the residuals are
independent, thus there is no serial dependence among them. There are several
methods for verifying this property, the most standard is the Ljung–Box test,
which tests whether a specified group (usually the first 10–20 lags) of
autocorrelations is different from zero. Another often applied serial
correlation test is the Breusch–Godfrey test. A more recent multidimensional
approach was published in , where the test was based on the
empirical copula process. The main ideas and the concept of this test stem
from .
For further details about time series analysis, see, for example, or .
Copulas and their goodness-of-fit
Bivariate data and the corresponding pseudo-observations.
Let X=(X1,…,Xd)T be a random vector with joint distribution
function FX(x)=FX1,…,Xd(x1,…,xd) and
marginal distribution functions F1(x1)=FX1(x1),…,Fd(xd)=FXd(xd). Sklar's theorem claims that there exists a copula C, a distribution over the d-dimensional unit cube, with uniform margins,
such that
FX1,…,Xd(x1,…,xd)=CF1(x1),…,Fd(xd).
Moreover, the copula C is unique if the marginal distribution functions are
continuous. This construction allows for the investigation of the dependence
structure without specifying the marginal distributions. In the recent
literature, various families of copulas have been introduced; for an overview
and examples see, e.g., the introductory textbook of .
In this paper we focus on testing the homogeneity of copulas, motivated by
the question of whether climate change also has an effect on the dependence
between pairs of temperature observations. If this change is indeed
observable, then it may have a substantial effect on the spatial structure of
temperature anomalies, worth further meteorological investigations. So we do
not have to go into the parametric inference, as we are just interested in
the homogeneity analysis.
Let us suppose we have two independent samples of Rd-valued
random vectors. The first sample is X1,…,Xn and the
second one is Y1,…,Ym. Formally we intend to test the
hypothesis that the dependence structure of the two copulas has arisen from
the same copula C0. The most obvious way for testing the homogeneity of
two copulas is to consider multidimensional χ2 approaches, but in this
case we need to discretise the data, losing valuable information. In order to
avoid its use, we can follow the approach of , who have developed a
method for this problem. Their approach is based on the empirical copula,
defined for the first sample as
Cn(u)=1n∑i=1nI(Ui≤u),
where u∈Rd and Ui denotes the d-dimensional
vector of the rank-based pseudo-observations:
Ui=Ui,n=(Ui1,n,…,Uid,n), where n refers to the
size of the sample and Uij,n=nn+1Fj(Xij). For
illustrations see Figure , which depicts the original data
points (standardised temperature data) and the Ui
pseudo-observations for the grid point pairs close to Budapest and Sopron,
respectively. Similarly, based on the pseudo-observations Ui and
Vi of the first and the second samples, respectively, we can define
the empirical copulas C1,n(u) and C2,m(u) (where n
and m denote the sizes of the samples).
The proposed tests for checking the homogeneity of two samples are based on
functionals of the empirical process:
κn,m(u)=C1,n(u)-C2,m(u)1n+1m,
where the asymptotic properties of the statistic can be based on the limit of
the empirical copula processes. There are two different kinds of approaches
investigated in : the Cramér-von Mises type statistic
Sn,m=∫[0,1]dκn,m(u)2du, and the Kolmogorov-Smirnov type statistic Tn,m=supu∈[0,1]dκn,m(u). As the
second approach is considered to be generally less powerful, we based our
inference on the statistic K*=1Nd∑i1,…,idκn,m(ti1,…,tid)2,
where (tij)ij=1N, j=1,…,d are appropriately fine divisions
of the interval (0,1). After some calculations, the Cramér-von Mises test
statistic can be written in the following form (see ):
Sn,m=1n+1m-1⋅1n2∑i=1n∑j=1m∏s=1d(1-Uis,n∨Ujs,n)++1m2∑i=1m∑j=1m∏s=1d(1-Vis,m∨Vjs,m)--2nm∑i=1n∑j=1m∏s=1d(1-Uis,n∨Vjs,m),
where u∨v=max(u,v).
Bootstrap methods
The bootstrap is a usually computer-intensive, resampling method for
estimating the distribution of a statistic of interest. The concept of the
bootstrap was introduced in the classical article by Bradley Efron
and, since then, it has become one of the most widely used Monte Carlo methods
in a number of areas of applied sciences.
Bootstrap for i.i.d. data
Let Xn=(X1,…,Xn)T be a sequence of independent,
identically distributed (i.i.d.) random variables with unknown common
univariate distribution F and let Tn=tn(Xn;F) be a
statistic (like the sample mean X‾). As F is unknown, the
distribution of the statistic Tn is also unknown. Our main purpose is to
approximate the distribution of Tn or its function of interest – for
example the standard deviation of Tn (the standard error) or some of its
quantiles for estimating p values. The basic bootstrap method (mostly
referred as i.i.d. bootstrap) is the following. For a given Xn,
we draw a random sample Xm*={X1*,…,Xm*} of size m
(usually m=n) with replacement from Xn. Therefore, the
common distribution of the Xi* is given by the empirical distribution
F^n=n-1∑i=1nδXi, where δz is the
probability measure having unit mass at z. In the next step, we define the
bootstrap version of the statistic Tn:
Tm,n*=tm(Xm*;F^n). By repeating this
procedure, we can approximate the unknown distribution Gn of Tn by its
bootstrap counterpart Gn*. In most of the cases the distribution of
Gn* cannot be determined explicitly, but it can be approximated by
simulation.
Block bootstrap methods
In our case we are interested in the effect of serial dependence on the
homogeneity tests and on modelling in general, for example on the covariance
matrix of our estimators. If the data are dependent then the estimates based
on i.i.d. bootstrap methods may not be consistent.
In the presence of serial dependence, one of the most commonly used methods
is the block bootstrap, see for details. In this paper,
we generalise the circular block bootstrap (CBB), which can be defined as
follows. First, we wrap the data X1,…,Xn around a circle, i.e.
define the series X̃t=Xtmod(n)
(t∈N), where mod(n) denotes division “modulo n”.
This means that
Xk=X̃k=X̃k+n=X̃k+2n=… for
all k∈{1,2,…,n}. For some m, let i1,…,im be a uniform
sample from the set {1,2,…,n}. Then, for a given block size b, we
construct n′=m⋅b (n′≈n) pseudo-data:
X̃(k-1)b+j*=X̃ik+j-1 where j=1,2,…,b and k=1,2,…,m.
Finally, let us calculate the function of interest, for example the bootstrap sample mean as follows:
X̃‾n′*=X̃1*+…+X̃n′*n′.
For the sake of simplicity we do not use this notation in the sequel; the asterisk simply denotes that the sample is a bootstrap sample.
Block length plays an important role in the process, and it is not trivial to
determine its optimal value. suggested an “automatic” block length
selection algorithm (its correction was published in ) – but
the practical applications of this method are far from obvious as there are
parameters in it that have to be chosen.
We used a similar approach in our previous paper . Our idea was
that we tried to find the best block size by fitting a VAR model to the data
and then checking the variance of X‾ with the help of the block
bootstrap. For sake of their simplicity, we restrict our attention to
VAR(p) models. The methodology should be general enough to be compatible
with other more complex classes of statistical models as well.
In , the block size was determined as the b^, for which the
estimated trace of the covariance matrix was closest to the one derived from the fitted VAR model:
b^=argmin1≤b∈ZtrCovX‾VAR-trCov*(X‾b*),
where Cov*(X‾b*)=Cov(X‾b*|Xn).
In the literature, simulations are naturally based on integer block sizes.
But using the block length of Eq. (), the estimated trace of
covariance may be not be close enough to the theoretical trace of covariance.
The same is true for other methods for block size determination. This may
cause substantial bias, as in our case the relative difference between
subsequent values of tr(Cov*(X‾b*)) can be
quite large, especially for small b. This can be overcome by the following
generalisation of the block bootstrap methodology.
Generalised block bootstrap
In case of b>1, b∈R, let the generalised block
bootstrap sample be defined as follows. Let k be a random integer between
1 and the sample size n and, again, let us wrap the sample around the
circle.
The bootstrap blocks are either of length ⌊b⌋ or ⌈b⌉: Xk,Xk+1,…,Xk+⌊b⌋with probability1-b+⌊b⌋Xk,Xk+1,…,Xk+⌈b⌉with probabilityb-⌊b⌋,
where ⌈b⌉ denotes the upper and ⌊b⌋ the lower
integer part of b. At last, we put the blocks together. This procedure
ensures that for integer-valued b the new definition coincides with the
traditional one, so this is indeed a generalisation. In the applications
(Sect. ) we show the clear advantages of this approach. Actually
all relevant algorithms for finding the optimal block size can easily be
adapted to find a solution in this generalised sense. In our case, instead of
Eq. (), we simply solve the equation in btrCov(X‾VAR)=trCov*(X‾b*).
In the same way as the circular block bootstrap sample, our generalised
bootstrap sample is not a stationary process, conditional on the original
sample. It is an important theoretical result of that the
stationary bootstrap sample is the only stationary block bootstrap
sample – here the block lengths follow a geometric distribution, independent
of each other.
The covariance matrix Cov*(X‾b*) can be
explicitly calculated. Henceforth, P* and E* denote the conditional
distribution and the conditional expectation given the sample
Xn; therefore, P*(L1=⌈b⌉)=P(L1=⌈b⌉|Xn) and E*(L1)=E(L1|Xn). Let us denote with
L1,L2,… the block sizes – they are random variables independent of
each other with common conditional distribution P*(L1=⌈b⌉)=1-P*(L1=⌊b⌋)=b-⌊b⌋. We can also write
Li=⌊b⌋+Ji, where Ji|Xn follows a Bernoulli
distribution with parameter p=b-⌊b⌋. Let N be the random
variable, which gives the number of blocks with block size ⌊b⌋. If we have N, we can calculate the number of blocks with block
size ⌈b⌉, denoting it with g(N). Therefore,
g(N)=n-N⋅⌊b⌋⌈b⌉. Let us denote the remainder block size with r(N), we can
calculate it from the others: r(N)=n-N⋅⌊b⌋-g(N)⋅⌈b⌉. It can be seen that the conditional covariance matrix of
the bootstrap mean can be calculated the following way:
Cov*(X‾b*)=⌊b⌋2n2Cov*(X‾⌊b⌋,i*)⋅E*N+Cov*N⋅X‾n(X‾n)T++⌈b⌉2n2Cov*(X‾⌈b⌉,i*)⋅E*(g(N))+Cov*(g(N))⋅X‾n(X‾n)T++1n2∑i=0⌊b⌋i2P*(r(N)=i)⋅Cov*(X‾i,1*)+Cov*(r(N))⋅X‾n(X‾n)T.
We have to mention that the Politis and White algorithm actually gives a
real number and not an integer as the optimal block size – this could
be used without any rounding by our proposed method. In their original paper
, the algorithm has been tested on data simulated from an AR(1)
process and gave fair values for the optimal block length. However, when we
tried to use this method to estimate the optimal block length for some
meteorological data (wind speed and precipitation), the algorithm gave too
large optimal block lengths.
Note as well that the type of block length that would be best for the block
bootstrap method depends on the inference problem (e.g. variance estimation
or testing), as described in or . There are two general
strategies for block selection that can be applied to problems such as the
homogeneity testing problem that we wish to consider. These block selection
methods are described in and , based on either
subsampling or non-parametric plug-in . We think that
these approaches can be modified for the proposed generalised block
bootstrap, and we plan to investigate this possibility in a separate more
theoretical paper.
Algorithm for calculating p value for homogeneity test of copulas
As the limit distribution of the statistic Sn,m is not
distribution free, a simulation algorithm is needed to get critical values.
The algorithm is the following:
Compute the statistic Sn,m, based on the original samples.
Generate B generalised bootstrap samples Xn*(i)=(X1*(i),…,Xn*(i))Ti=1,2,…,B from the first observation vector of size n.
Compute the statistics Sn,m*(i)i=1,2,…,B, based on the bootstrap samples and the original second, m length sample.
Compute the p value: 1+#Sn,m*(i)≥Sn,mB+1.
Guaranteeing the homogeneity of copulas in the context of bootstrap
approaches is still an unsolved problem in general. Traditional bootstrap
approaches have been claimed to be inconsistent for Cramér–von Mises statistics
. The multiplier method of , which addresses the latter
problem, is consistent but numerically costly, and thus practically not
applicable to sample sizes of O(1000) as considered in the present work.
introduced a general bootstrap algorithm leading (under weak
assumptions) to a consistent bootstrap statistic when testing for the
homogeneity of the marginal distributions of a k-dimensional random
variable; suggested a special bootstrap method for a Cramér–von
Mises test for the homogeneity of two distributions, which has a consistent
limit distribution. However, these two recent developments are not directly
transferable to the problem of testing for the homogeneity of two copulas. In
the case of VAR(1) models as studied in the present work, numerical results
indicate the homogeneity of the copulas under the employed generalised block
bootstrap.
Simulations
In this section we present some properties of the copula homogeneity test
obtained via simulations, strongly focusing on a specific VAR(p) process,
arising in Sect. 6.
By using the bootstrap methodology, we can investigate the significance level
of the homogeneity test. Our simulations indicate that the test is consistent
for each block size and each relevant time series model. However, we find,
that for VAR(p) processes, the distribution of the test statistic is
different, if the first sample is generated via block bootstrap simulations.
We will illustrate this via the VAR(1) process with parameters
A=0.0970.216-0.1030.403andC=0.4490.4060.4060.436,
let us call the VAR(1) process simulated with these parameters the “Budapest and
Apatovac process”, as these A and C matrices are the VAR(p)
coefficient estimates of temperature data pairs for Budapest and Apatovac, as
described in Sect. 6. Table shows that the empirical mean and the
0.9, 0.95 and 0.99 empirical quantiles of the test statistic are
substantially greater, if the first sample is bootstrapped, but essentially
unaffected by the block size. So we found that the distribution of the
bootstrapped test statistic is not the same as the one without bootstrapping.
This is an interesting result, worthy of further investigation. Therefore,
the reference distribution for our hypothesis testing was the empirical
distribution of the generalised block bootstrap procedure, which depends on
the underlying stochastic process and also the sample size.
Simulated quantiles of the homogeneity test statistic for the VAR(1)
process approximating the temperature data close to Budapest and Apatovac with
sample size n=1000. The first row presents the simulation results without
using the bootstrap.
Now we estimate the power of the proposed homogeneity test. We take samples
from the VAR(1) process of Budapest and Apatovac – these are the fixed,
H0 samples. As alternative, H1 hypothesis, we chose several models:
VAR(p) models, but with other parameters, representing stronger dependence;
i.i.d. bivariate normal distributed samples; MA(4) models; mixtures of
MA(4) models. In each case, the power of the test seems to converge to
100 %, as the sample size increases. For example, Table shows
powers for different block sizes, if the alternative hypothesis is a VAR(1)
process with parameters
Ã=-0.10.3-0.80.9andC̃=0.4490.4060.4060.436.
The block size does not have a big effect on the power. We have to mention that
although the numbers manifest consistency, the last digit of the powers in
the table may not be accurate. We took 100 samples with 100 bootstrap
replicates (let us call these 100×100 simulations) and as we made
another 100×100 simulations, the powers sometimes differed by several
percentages. This was especially true for the 99% significance level,
therefore in the following, we do not show the simulations for this level.
The power of the homogeneity test (%). Null hypothesis: the
VAR(1) process of Budapest and Apatovac for different block sizes and sample
sizes (n). Alternative hypothesis: also a VAR(1) process, but with other
parameters, representing stronger dependence.
As we shall see in the next section, for the data pairs Budapest and Apatovac,
the block size of 8.71 is going to be a good choice. Table shows that
the tests are consistent also for this block length, using 100×100
simulations. It is important to note that these results are power values for
a specific null and alternative hypothesis (typical for non-parametric
statistical tests).
The power of the homogeneity test (%). Null hypothesis: the VAR(1)
process of Budapest and Apatovac (H0) for block size 8.71; against the
VAR(1) process simulated with coefficient matrices Ã
and C̃ as alternative hypothesis.
We also simulated the test statistic for other models: if the reference H0 sample is
i.i.d. bivariate normal distributed;
an MA(4) process;
a stationary GARCH(1,1) process.
In the two latter cases, we simulated samples for two cases in which the
coordinates were independent and correlated, respectively. The alternative
hypothesis varied from test to test, for example if the H0 sample was
i.i.d. bivariate normal distributed, we chose H1 as another i.i.d.
bivariate normal distributed process and a VAR(1) process. Each homogeneity
test proved to be consistent. Table shows the power of the
homogeneity test, when the coordinates of the reference sample are a
realisation of a GARCH(1,1) process with parameters 0.001, 0.028 and
0.97 – typical values if we model exchange rates with the GARCH(1,1)
process. The alternative hypothesis was a VAR(1) process. In this case the
fractional block size had a bit of a stronger effect on the powers of the test and
the block size 11.35 proved to be the most powerful. This was the only
example in which we found a specific block size, which essentially maximised
the power of our homogeneity test.
The power of the homogeneity test (%). Null hypothesis: GARCH(1,1)
process with different block sizes, with sample size n=100. Alternative
hypothesis: VAR(1) process simulated with coefficient matrices A
and C.
The observations comprise 63 years of daily temperature data of the European
Climate Assessment (E-OBS; http://www.ecad.eu). The methodology of
deriving the data for the grid points has been published in ,
where this database has been used extensively for climate analysis. We have
worked with the part of the 0.5∘ grid – available for whole Europe
and northern Africa – which lies in the Carpathian Basin. Figure
depicts the used grid points. For later reference, we chose the grid point
next to Budapest, one grid point
in the neighbourhood of the Hungarian capital and four further grid points
lying far from Budapest, in different directions.
The quality of the data has been evaluated, e.g., in ,
and it turned out to be reliable for most of central Europe.
As we have used the grid points, belonging to the Carpathian Basin, this validates our data.
The map of the Carpathian Basin with the used grid points.
As we intend to use models, suitable for stationary data, first the
stationarity had to be ensured. We have first subtracted the smoothed daily
averages from the observations. The smoothing was made by loess regression;
Fig. a, b depict the daily averages and standard deviations of the
63-year data and the smoothing regression line for the grid point near
Budapest. It turned out that the second-order stationarity assumption is
still far from being true (in winter the variances were substantially larger
than in summer), so we have divided the observations by the smoothed
estimated standard deviation for the given day:
x̃t,n=xt,n-mtst,
where x̃t,n is the standardised value for day t in year
n, based on the original observation xt,n for the same day and the
smoothed average mt and smoothed standard deviation st.
Figure c shows the original daily observations and the
standardised data between 1 January 2010 and 31 December 2012 for the grid
point near Budapest.
(a) The daily averages (annual cycle) and the smoothing
loess regression; (b) the daily standard deviations (annual cycle)
and the smoothing loess regression; (c) the original and the
standardised data for the grid point near Budapest.
Results of tests checking for serial dependence between the
estimated residuals (p values) and the chosen order of the VAR model by
Akaike criterion.
Pairs of selected grid pointsOptimalLjung–Breusch–Genest–Rémillard–orderBoxGodfreyKojadinovic–YanBudapest and Sopron30.2710.3240.259Budapest and Apatovac10.0800.4630.120Budapest and Zaránd Mountains90.1740.0170.149Budapest and Nyíregyháza40.1740.4740.258Budapest and Püspökhatvan40.1550.5390.276
In order to reduce the strong serial dependence, we have finally computed the
10-day averages of the x̃ values. As there are no outliers in the
temperature data (see Fig. c) and the series is nearly normal, the
mean was chosen as the most suitable function. There is a slight but
significant upward linear trend in the data, but we did not remove it, as one
of our main aims was to detect the changes in the dependencies of the
investigated sites – and these should be based on the original
(standardised) deviations, as constructed above.
(a) The trace of the covariance matrix of the mean for two
selected grid points near Budapest and Sopron; (b) the trace of the
covariance matrix of the mean for two selected grid points near Budapest and
Apatovac (first half of the sample).
Optimal block length for the first half of the samples for the five selected pairs of grid points.
Pairs of selected grid pointsn⋅trCov(X‾VAR)OptimalNumber ofblock sizeiterationsBudapest and Sopron1.75913.612Budapest and Apatovac1.8058.712Budapest and Zaránd Mountains4.42755.427Budapest and Nyíregyháza1.8799.883Budapest and Püspökhatvan2.14015.683
p Values of the copula homogeneity test, if the first/second half
of the sample is bootstrapped – based on 104 simulations.
Pairs of selected grid pointsp values, bootstrapped half: firstsecondBudapest and Sopron0.0640.047Budapest and Apatovac0.0280.012Budapest and Zaránd Mountains0.0340.042Budapest and Nyíregyháza0.1160.062Budapest and Püspökhatvan0.8480.751
The pseudo-observations of Budapest and Apatovac for the first and
the second half of the 10-day averages of the standardised observations.
In the next step, we examined the fixed grid point near Budapest paired with
other grid points of the database. Using the Akaike information criterion, we
chose the orders of the most appropriate vector autoregression to model our
data pairs. Despite the adjusted R2 values being rather low (around
10 %), the Ljung–Box Q test, the Breusch–Godfrey test and the test of
could not detect the presence of further serial dependence that
has not been included in the VAR model. Table contains these
results.
Our main goal is to detect if there is a significant change in the dependence
structure of the data. We separated the pairs of points into two parts – the
first part corresponds to the first 31.5 years' observations and the second
part to the second 31.5 years' observations. For five selected pairs of grid
points, we wanted to test the null hypothesis that the copula of first half
of the sample is equal to the copula of the second half of the sample. We
have tested the independence by the Cramér–von Mises type test of
, included in the R package copula and the result was
clear – the test could not detect any dependence (neither for the used
complete halves, nor for shorter adjacent subsequences from sample sizes of
100 to 1000).
Table and Fig. depict the optimal block lengths
obtained from solving Eq. (). The second column of Table
and the red line of Fig. show the trace of the covariance
matrices of the mean, calculated from the fitted VAR(1) models, multiplied by
1186 – the half of the original sample size. In Fig. a it can be
seen that the traces of the covariance matrix of the mean at the integer
block sizes 13 and 14 are 17.8 and 18.05 (green dashed lines), which are
quite far from the red line. Figure b shows that the trace is not
monotonic in the neighbourhood of the optimum, so we have to be cautious when
we search for the optimal block size, because Eq. () can have
multiple solutions. The trace function (black line in Fig. ) is
always continuous, but not necessarily differentiable, resulting from the
construction of our generalised block bootstrap method. This follows from
Eq. (), where there are lower and upper integer parts of b in
the formula.
Generally, we noticed, that as the block sizes tend to be smaller, the trace
function is closer to be monotonic. This phenomenon can be explained by the
expansion of Cov*(X‾b*) in formula
(): if the block size is relatively small compared to the sample
size, then the first and second terms are much more dominant over the third
part – which contains the effect of the remainder block size. We got pretty
small, 8.71 optimal block size for the pairs Budapest and Apatovac, and a
much larger value of 55.42 for Budapest and Zaránd Mountains. This reflects
that the optimal lag order of the fitted VAR(1) model was much higher for the
pairs Budapest and Zaránd Mountains. In case of the five selected pairs, as
many as seven iterations were always enough to solve Eq. (). We
have to mention that there exist some pairs of grid points – especially at
the southern part of the Carpathian Basin – for which Eq. () is
not solvable.
The last step was conducting the copula homogeneity test described in
Sect. . Using the optimal block size, we generated bootstrap
samples via the generalised block bootstrap method for the first half of the
sample. With the empirical copulas of these bootstrap samples and the
empirical copula of the second part of the original sample, we can calculate
the test statistic Sn,m. In our case n=m=1186. In order to get
accurate p values, we used 104 repetitions. This procedure can be
conducted in the reversed way as well: we may generate bootstrap samples from
the second half of the sample and fix the first half. Table
contains the results of the homogeneity test for both versions. The
dependence structure proved to be different in the first 31.5 years at the
two pairs Budapest and Apatovac and Budapest and Zaránd Mountains.
Figure depicts the 10-day averages of the standardised
observations and their copula of the pair Budapest and Apatovac. We can see
that the pseudo-observations are apparently somewhat different, and the test
also detected deviance between the two copulas.
Conclusions
In this paper, we have used the bootstrap for determining the p values of a
homogeneity test for copulas. But the approach of block bootstrap is much
more generally applicable than in our case. Our results underline that the
block size determination is definitely not yet a completely solved question
in spite of the available asymptotic results, as for finite samples and
different statistical inference or testing problems the results might also be
substantially different.
We can summarise our findings as follows.
First, we proposed a simple generalisation of the block bootstrap
methodology, which fits naturally to the existing algorithms, and which helps
to overcome the problem of discreteness in the usual block size. The proposed
generalised block bootstrap method can easily be applied to any other
problem, where the block size plays an important role, as all block length
determining algorithms give a real number as estimated block size.
Second, we have found some significant changes in the dependence structure
between the standardised temperature values of pairs of stations within the
Carpathian Basin. The direction of this change may be worth further
investigation, as this may lead to a better understanding of the recent
changes in our climate.
It is an interesting open question, to which models and inference problems
the proposed block size determining method – based on functions of the
variances – can be successfully applied. We have checked by simulation that
the method can be applied to the specific VAR models, described in our
article. For non-linear time series we might need more observations to get a
fit, which is similarly reliable. As a general comment on the use of
bootstrap methods, we have seen cases when the block size did not play an
important role, but in our opinion this is rather an exception than a rule.
Choosing a not optimal block size may decrease the accuracy of the applied
method somewhat, but not using any type of block bootstrap may distort the
results completely as quite a few of the references have already
demonstrated.
The E-OBS dataset is regularly refreshed. The most
up-to-date version is 15.0, dated June 2017. However at the time of the
writing of the paper, 12.0 was the most recent, which is still available from
the webpage http://www.ecad.eu/download/ensembles/oldversions.php.
The authors declare that they have no conflict of
interest.
Acknowledgements
We acknowledge the E-OBS data set from the EU-FP6 project ENSEMBLES
(http://ensembles-eu.metoffice.com) and the data providers in the
ECA&D project
(http://www.ecad.eu).Edited by: R. Donner
Reviewed by: three anonymous referees
References
Brockwell, P. J. and Davis, R. A.: Time series: theory and methods, Springer
Science & Business Media, 2013.
Bücher, A. and Volgushev, S.: Empirical and sequential empirical copula
processes under serial dependence, J. Multivariate Anal., 119,
61–70, 2013.
Chernick, M. R.: Bootstrap methods: A guide for practitioners and
researchers,
vol. 619, John Wiley & Sons, 2011.Cong, R.-G. and Brady, M.: The interdependence between rainfall and
temperature: copula analyses, The Scientific World Journal, 405675,
10.1100/2012/405675,
2012.
Efron, B.: Bootstrap methods: another look at the jackknife, Ann.
Stat., 7, 1–26, 1979.
Farook, A. J. and Kannan, K. S.: Climate Change Impact on Rice Yield in
India–Vector Autoregression Approach, Sri Lankan Journal of Applied
Statistics, 16, 161–178, 2016.
Genest, C. and Rémillard, B.: Test of independence and randomness based
on
the empirical copula process, Test, 13, 335–369, 2004.
Genest, C., Quessy, J.-F., and Rémillard, B.: Goodness-of-fit procedures
for copula models based on the probability integral transformation,
Scand. J. Stat., 33, 337–366, 2006.
Hall, P., Horowitz, J. L., and Jing, B.-Y.: On blocking rules for the
bootstrap
with dependent data, Biometrika, 82, 561–574, 1995.Haylock, M., Hofstra, N., Klein Tank, A. M. G., Klok, E. J., Jones, P. D.,
and
New, M.: A European daily high-resolution gridded data set of surface
temperature and precipitation for 1950–2006, J. Geophys.
Res.-Atmos., 113, D20, 10.1029/2008JD010201, 2008.Hill, D., Bell, K. R. W., McMillan, D., and Infield, D.: A vector
auto-regressive model for onshore and offshore wind synthesis incorporating
meteorological model information, Adv. Sci. Res., 11, 35–39,
10.5194/asr-11-35-2014, 2014.
Hoeffding, W.: Massstabinvariante korrelationstheorie, Teubner, 1940.Hofstra, N., Haylock, M., New, M., and Jones, P. D.: Testing E-OBS European
high-resolution gridded data set of daily precipitation and surface
temperature, J. Geophys. Res.-Atmos., 114, D21, 10.1029/2009JD011799,
2009.
Huang, Q. and Jing, P.: Cramer-Von Mises Statistics for Testing the Equality
of
Two Distributions, in: Frontier and Future Development of Information
Technology in Medicine and Education, Springer, 93–101, 2014.
Kojadinovic, I. and Holmes, M.: Tests of independence among continuous random
vectors based on Cramér–von Mises functionals of the empirical copula
process, J. Multivariate Anal., 100, 1137–1154, 2009.
Kojadinovic, I. and Yan, J.: Tests of serial independence for continuous
multivariate time series based on a Möbius decomposition of the
independence empirical copula process, Ann. I. Stati.
Math., 63, 347–373, 2011a.
Kojadinovic, I. and Yan, J.: A goodness-of-fit test for multivariate
multiparameter copulas based on multiplier central limit theorems, Stat.
Comput., 21, 17–30, 2011b.
Lahiri, S. N.: Resampling methods for dependent data, Springer Science &
Business Media, 2003.
Lahiri, S. N., Furukawa, K., and Lee, Y.-D.: A nonparametric plug-in rule for
selecting optimal block lengths for block bootstrap methods, Statistical
Methodology, 4, 292–321, 2007.
Martínez-Camblor, P., Carleos, C., and Corral, N.: Cramér-von mises
statistic for repeated measures, Revista Colombiana de Estadística, 37,
45–67, 2014.
Nelsen, R. B.: An introduction to copulas, Springer Science & Business
Media,
2007.
Nordman, D. J. and Lahiri, S. N.: Convergence rates of empirical block length
selectors for block bootstrap, Bernoulli, 20, 958–978, 2014.Norrulashikin, S. M., Yusof, F., and Kane, I. L.: An Investigation towards
the
Suitability of Vector Autoregressive Approach on Modeling Meteorological
Data, Modern Applied Science, 9, 89–100, 2015.
Patton, A., Politis, D. N., and White, H.: Correction to “Automatic
block-length selection for the dependent bootstrap” by D. Politis and H.
White, Economet. Rev., 28, 372–375, 2009.
Politis, D. N. and Romano, J. P.: The stationary bootstrap, J.
Am. Stat. Assoc., 89, 1303–1313, 1994.
Politis, D. N. and White, H.: Automatic block-length selection for the
dependent bootstrap, Economet. Rev., 23, 53–70, 2004.
Rakonczai, P., Varga, L., and Zempléni, A.: Copula fitting to
autocorrelated
data with applications to wind speed modelling, Annales Universitatis
Scientarium de Rolando Eotvos Nominatae, Sectio Computatorica, 43, 3–20,
2014.
Rémillard, B. and Scaillet, O.: Testing for equality between two copulas,
J. Multivariate Anal., 100, 377–386, 2009.Schölzel, C. and Friederichs, P.: Multivariate non-normally distributed
random variables in climate research – introduction to the copula approach,
Nonlin. Processes Geophys., 15, 761–772, 10.5194/npg-15-761-2008, 2008.
Shumway, R. H. and Stoffer, D. S.: Time series analysis and its applications
with R examples, Springer Science & Business Media, 2011.
Sims, C. A.: Macroeconomics and reality, Econometrica: Journal of the
Econometric Society, 1–48, 1980.