Estimating trends in the global mean temperature record

Given uncertainties in physical theory and numerical climate simulations, the historical temperature record is often used as a source of empirical information about climate change. Many historical trend analyses appear to deemphasize physical and statistical assumptions: examples include regression models that treat time rather than radiative forcing as the relevant covariate and time series methods that account for internal variability nonparametrically. However, given a limited record and the presence of internal variability, estimating radiatively forced historical temperature trends necessarily requires assumptions. Ostensibly empirical methods can involve an inherent conflict in assumptions: they require data records that are short enough for naive trend models to apply but long enough for internal variability to be accounted for. In the context of global mean temperatures, methods that deemphasize assumptions can therefore produce misleading inferences, because the twentieth century trend is complex and the scale of correlation is long relative to the data length. We illustrate how a simple but physically motivated trend model can provide better-fitting and more broadly applicable trend estimates and can address a wider array of questions. The model allows one to distinguish, within a single framework, between uncertainties in the shorter-term versus longer-term response to radiative forcing, with implications not only on historical trends but also on uncertainties in future projections. We also investigate the consequence on inferred uncertainties of the choice of a statistical description of internal variability. While nonparametric methods may seem to avoid making explicit assumptions, we demonstrate how even misspecified parametric methods, if attuned to important characteristics of internal variability, can result in more accurate statements about trend uncertainty.


Introduction
The physical basis of climate change is understood through a combination of theory, numerical simulations, and analyses of historical data. Climate change is driven by radiative forcing, a change in net radiation (downwelling minus upwelling, often specified at the top of atmosphere) resulting from an imposed perturbation of a climate in equilibrium, for example by increasing the atmospheric concentration of a greenhouse gas. The Earth's response to forcing is complex and not fully understood, in part due to physical uncertainties in important feedbacks such as cloud responses (see the assessment reports of the Intergovernmental Panel on Climate Change (IPCC), e.g., IPCC (2013)).
Given the physical uncertainties inherent in all climate simulations, the observed temperature record since the late nineteenth century is often used as a source of empirical information about the Earth's systematic response to forcing. (Figure 1 shows one estimate of annually averaged global mean surface temperatures from the past 136 years, along with estimates of radiative forcings from various constituents during that period, with the data sources described in Section 2.) Analysis of the observed temperature record is complicated, however, by the short available record of direct measurements, by uncertainties in the historical radiative forcings themselves, and by the internal temperature variability that exists even in the absence of forcing. Statistical methods are therefore required to quantify the information in the historical record about the response to forcing: given the data, what do we know about how global temperatures have warmed in response to forcing, how much warming can we expect in plausible future forcing scenarios, and how do we expect uncertainties to change as we continue to observe the Earth's temperatures?
It can be helpful to divide approaches to using the observed temperature record to understand aspects of mean climate change into two categories. One common approach involves assuming a physical model of the system. (Here the term "model" encompasses anything from a simple energy balance model to a very complicated atmosphere-ocean general circulation model (GCM)). Analysis then may involve estimating statistical parameters in order to best fit the observed record. Estimated parameters have statistical uncertainty because of the finite observational record and the internal variability inherent in the climate system, even were the model a perfect representation of reality. Analyses of observed temperatures using simple or moderately complex physical models include Wigley et al. (1997), Gregory et al. (2002), Knutti et al. (2002), Forest et al. (2002), Forest et al. (2006), Gregory and Forster (2008), Padilla et al. (2011), Aldrin et al. (2012), Otto et al. (2013), Rypdal and Rypdal (2014), Masters (2014), Lewis and Curry (2015), Zeng and Geil (2016), and many others. Analyses using GCMs are summarized in IPCC (2013) Chapter 10 and references therein.
Another category of approaches involves analyses that are more empirical and appear to deemphasize assumptions about the underlying physics generating the observed temperatures. Many studies use regression models that treat time rather than radiative forcing as the covariate. This practice is often used, for example, to test for significant warming (e.g., Bloomfield (1992), Smith and Chen (1996), Løvsletten and Rypdal (2016), and others) or for changes in warming trends (e.g., Foster and Rahmstorf (2011), Cahill et al. (2015), Rajaratnam et al. (2015), and others); in general, regressions in time are widespread in the literature (see also IPCC (2013), especially Chapter 2 Box 2.2, and references therein).
In both categories, analyses require characterizing internal variability for the purpose of quantifying uncertainty, a task that also involves assumptions. A typical approach is to Bottom left, estimates of (top-of-atmosphere) effective radiative forcings from different constituents over this time period (the "other" category includes O 3 , H 2 O, black carbon, contrails, and land use changes). Right, temperature anomalies vs. net radiative forcings (top), vs. anthropogenic forcings (middle), and vs. natural forcings (bottom). Data sources are described in Section 2. Despite the correlation in the plots of temperature vs. radiative forcing, temperatures will depend on the full past trajectory of radiative forcings in a potentially complex way, as we discuss in Section 3. assume a statistical model for the dependence structure of the noise, such as assuming an autoregressive moving average (ARMA) noise model with a small number of parameters, which is fit to the residuals of whatever trend model is being used. Some authors, however, argue for nonparametric (resampling or subsampling) methods for time series rather than parametric approaches (e.g., Gluhovsky (2011) and Rajaratnam et al. (2015)) 1 . The argument is again that these approaches are advantageous because they are ostensibly objective and require fewer assumptions. However, methods that deemphasize assumptions, be they physical or statistical, can be problematic in the climate setting. While regressions in time are simple to apply and do not appear to make explicit assumptions about how temperatures should respond to forcing, these models both limit what can be learned from the data and can result in misleading inferences. Regressions in time are sensitive to arbitrary choices (such as the start and end date of the data analyzed), cannot be expected to apply over even modestly long timeframes, and cannot in general reliably separate forced trends from internal variability. Furthermore, in accounting for internal variability, nonparametric methods for time series often require long data records to work well, and can be seriously uncalibrated in data-limited settings with strong temporal correlation, such as the setting we are discussing.
In the following, we illustrate two primary points. First, we show that targeted parametric mean models that incorporate even limited physical information can provide better fitting, more interpretable, and more illuminating descriptions of the systematic response of interest compared to approaches that deemphasize assumptions. Second, we show that parametric models for residual (i.e., internal) variation can provide for safer and more accurate uncertainty quantifications in this setting than do approaches that deemphasize assumptions, even if the parametric model is misspecified, as long as the parametric modeling is done with particular attention towards the representation of low-frequency internal variability. We believe that the analysis that we present is informative, even if not maximally so, and we attempt to highlight both complications with our analysis as well as important sources of information about global warming that are ignored in our approach. Parts of our analysis share similarities with others listed above, especially with Rypdal and Rypdal (2014) and Zeng and Geil (2016). In distinction to previous papers, here we are primarily interested in contrasting what can be learned using physically motivated models versus with those that de-emphasize assumptions, as well as in emphasizing the role that accounting for internal variability plays in inferring uncertainties in mean trends. Our goals are to indicate directions in which statisticians can incorporate explicit modeling to positive effect and to highlight what we view are some of the important sources of uncertainty and information in this problem.
This article is organized as follows. In Section 2, we introduce the data sources used in our analysis. In Section 3, we provide some background on modeling the historical global mean temperature record and contrast a simple, minimally physically informed model with a more empirical approach. In Section 4, we highlight insights that can be gained from the more informed approach, with an emphasis on probing different aspects of uncertainty in trends. In Section 5, through synthetic simulations, we compare the performance of various 1 In this work we consider a parametric method to be any that makes explicit assumptions about the functional form of the probability distribution that the data come from, described with a finite number of statistical parameters. In particular, for the purposes of this work, we consider the class of low-order ARMA models to be a parametric class. We consider a nonparametric method to be one that attempts to make fewer distributional assumptions about the data and does not involve a parametrized statistical model. parametric and nonparametric methods of uncertainty quantification in the presence of temporal correlation in settings similar to that of the historical temperature record. In Section 6, we give some concluding remarks.

Background and data
This analysis requires estimates of historical global mean temperatures and radiative forcings. To the extent that we are interested in how temperatures may evolve in the future (and how uncertainty in the response to radiative forcing evolves as more data is observed), we also need radiative forcings associated with a plausible future scenario.
Temperature. We use the Land-Ocean Temperature Index from the NASA Goddard Institute for Space Studies (GISS) (Hansen et al., 2010) in our primary analysis. The index combines land and sea surface temperature measurements to estimate annual average global mean surface temperature anomalies (relative to a base period from 1951-1980), extending from the year 1880 to the present (comprising 136 years in total). Any dataset of global mean temperature anomalies represents an estimate of that quantity and is subject to some uncertainty. Sources of uncertainty include the spatial coverage of the network of measurements, interpolation schemes used to estimate temperatures at unobserved locations, methods used to incorporate different sources of data (e.g., land-vs. satellite-vs. surface buoy-vs. ship-based data), and instrumental errors. Sources of uncertainty in the GISS dataset are discussed in Hansen et al. (2010). NASA GISS has made some attempt to provide pointwise uncertainty estimates for their data (e.g., Figure 9(a) of Hansen et al. (2010)), but it is important to realize that errors will be correlated in time. That said, uncertainties in the global mean temperature record are relatively small compared to the changes in temperatures observed over the 20th century.
To evaluate the effects of uncertainty in the temperature record, we repeat a small portion of our analysis using the HadCRUT4 global annual temperature ensemble (Morice et al., 2012), designed for that purpose (Section 4.5). (Some sources of uncertainty are, however, common to both data sources; a brief comparison of these two sources, as well as one from the National Oceanographic and Atmospheric Administration (NOAA), can be found in IPCC (2013), Section 2.4.3.) Radiative forcing. The primary driver of climate change during the historical period is changing atmospheric CO 2 concentrations; radiative forcing due to these changes scales approximately with the logarithm of the ratio of the CO 2 concentration at any given time to its preindustrial level (e.g., Arrhenius (1896) and many others). However, variations in other agents also have non-negligible forcing effects that must be taken into account to interpret the historical record (e.g., aerosols from human sources or volcanoes, other greenhouse gases, etc.). In this work, we aggregate the effects of different forcing agents by using their estimated effective radiative forcing, the radiative imbalance after rapid atmospheric adjustments. These adjustments are intended to partially compensate for differing efficacies of forcing agents. In practice, effective forcings are often treated as though they may be combined additively. For effective radiative forcings from 1750-2011, we use the estimates in IPCC (2013) Table AII.1.2 (Figure 1 only shows the forcings after 1880, but the full available record is used in our analysis). From 2011-2015, we use the global CO 2 concentrations from NOAA and treat radiative forcing from other sources as constant during this period.
While historical concentrations of CO 2 are relatively well known, since CO 2 is wellmixed and long-lived, those of other forcing agents, such as tropospheric aerosols, are more difficult to measure because they are spatially heterogenous and short-lived. Uncertainties in forcings associated with tropospheric aerosols are important because aerosol effects can be negatively confounded with greenhouse gas effects (see Figure 1, bottom left). Generally, uncertainties vary by constituent, as does the extent to which the estimates are derived from model output versus observations; see IPCC (2013) Chapter 8. The focus of this paper is on the information content of the observed temperature record assuming known forcings, but we make a limited attempt to discuss the effect of uncertainty in the forcings (Section 4.5).
For a plausible future radiative forcing scenario, we use the extended Representative Concentration Pathway scenario 8.5 (RCP8.5) Meinshausen et al., 2011), where the change in radiative forcing from the preindustrial level is 8.5 W/m 2 by the year 2100 and levels off at around 12 W/m 2 in the year 2250. In our simulations, we slightly increase (by about 0.07 W/m 2 ) the radiative forcings from the RCP8.5 scenario in the 21st century to match what we take to be the historical value in 2015, and we assume that natural forcings remain constant after 2015.
Ocean heat uptake. The analysis here focuses on information provided solely by the global mean temperature record and assumed known forcings. We do not use additional potential sources of empirical information, including estimates of ocean heat uptake (discussed in, e.g., Forest et al. (2002) and Knutti et al. (2002)). Many empirical analyses of the historical record do incorporate information about ocean heat content (e.g., Gregory et al. (2002), Otto et al. (2013), Masters (2014), and Lewis and Curry (2015)). We compare results with these studies in Section A1.

Modeling trends in the observed global mean temperature record
Evaluating the systematic response of global mean surface temperatures to forcing is complicated by the long timescales for warming of the Earth system. Because the Earth's climate takes time to equilibrate, the near-term (transient or centennial-scale 2 ) climate response will be less than the long-term (equilibrium or millennial-scale) response. The evaluation is also complicated by the fact that historical radiative forcings are not constant but rather evolve in time (e.g., atmospheric CO 2 increases). The physical lags in response imply that the Earth's global mean temperature at any given time depends on the past trajectory of radiative forcings (because the climate does not instantly equilibrate to the present forcing). A common framework is to decompose observed temperatures into two components: a systematic component changing in response to past forcings and a residual component representing sources of internal variability. That is, for global mean temperatures T (t) at time t, where f is an unknown functional of {F (t ); t ≤ t}, the collection of past radiative forcings associated with each forcing agent, and (t) is a residual process that has mean zero and is correlated in time. Here we emphasize in our notation that the systematic response of interest is the response conditional on a given forcing trajectory. The problem, then, is how to estimate the systematic response f using the historical temperatures and forcings.

Regression models in time
Estimating a model like (1) is intractable without additional assumptions. As discussed above, one approach is to resort to physical models. But if instead a more empirical analysis of the observed data is desired, it is common practice to consider a surrogate regression on time itself, as stated in Section 1. The implicit assumption here is that, when viewed as a function of time, the systematic response to historical radiative forcings is approximately linear in time, at least over the considered timeframe: where α and β are unknown parameters and the trend is considered over the interval [t 0 , t 1 ]. The linear time trend approach arguably involves assumptions about the forcing history in addition to the systematic response f : that the forcing itself evolves approximately linearly in time (else the approximation would not be appropriate).
The linear time trend model is widely used, and the general sense is that such a model offers a way of testing for statistically significant changes in mean temperature without having to make physical assumptions and without having to believe that the true forced response is linear in time (e.g. Bloomfield (1992) and Løvsletten and Rypdal (2016)). The IPCC, accounting for the apparent appeal of the linear time trend model, writes that it "is relatively simple, transparent and easily comprehended, and is frequently used in the published research assessed here," (IPCC (2013) Chapter 2, Box 2.2), but suggests that linearity in time can at best be viewed as an approximation expected to hold over a relatively short period of time. It is well understood that neither the observed temperature record nor the forcing history appear to evolve linearly over the full range of the data record ( Figure 1, left), and most users of the linear time-trend approach confine their analysis to only the past few decades.
While the time trend model may be routine to apply, appear objective, and provide a good fit to the data, its use can be precarious. A proper accounting of uncertainty in mean temperature changes relies on distinguishing internal variability from systematic responses. The time trend model is problematic in this respect. If the chosen time interval is short, it can be difficult to distinguish between trends and sources of internal variability that are correlated over longer timescales than the chosen interval (implicitly recognized in, e.g., Easterling andWehner (2009) andSanter et al. (2011) and explicitly discussed in a broader context in Wunsch (1999)). If the chosen interval is long and the systematic trend is actually nonlinear in time, then assuming a linear model in time will shift part of the systematic response to the residual process and can therefore give the impression of excessive internal variability over long timescales (and hence excessive uncertainty in trends).
Because the time trend model cannot be applied over long time intervals for arbitrary forcing scenarios, it also does not have a property that may be considered important for making inferences: that we can learn more about the systematic trend of interest by collecting more observations. There will be only a finite amount of information about the systematic response within the interval [t 0 , t 1 ] (this because sources of internal variability will be positively correlated in time). While this on its own does not invalidate the use of such a model over some narrow time frame, it does mean that what can be learned from the linear time trend model is necessarily limited. More broadly, since the linear time trend model does not map to a physical understanding of the relationship between radiative forcing and global mean temperatures, either during the time interval [t 0 , t 1 ] or extending beyond it, the questions that can be asked with this model are narrow.
Some argue that many of these problems may be overcome by using a model that is nonlinear in time, such as a spline or other nonparametric regression method. (The IPCC, for example, appears to view nonparametric extensions as more generically appropriate than the linear model.) Nonparametric regressions in time will appear to provide an even better fit to the data than the linear trend model, but many of the above arguments carry over to this setting. Such models have limited interpretational value or ability to capture systematic (non-internal) trends, since they cannot generically be expected to distinguish between the systematic trends of interest and other, internal sources of long-timescale variation in the data. Collectively, these arguments suggest that it is advisable to seek better motivated models if one is interested in understanding the systematic response of global temperatures to forcing.

A simple, physically-based model for the centennial-scale response to forcing
A typical approach is to use more complex models, including full GCMs, to explain the systematic response of interest. (Model output is also used in concert with observations in the context of "detection and attribution" studies; see, e.g., Chapter 10 of IPCC (2013).) Some may object to this approach, however, out of a worry that the climate model has already been tuned to match the observed historical temperature trend or is otherwise conditioned on past temperature observations (Knutti, 2008;Huybers, 2010;Mauritsen et al., 2012). There is therefore value in a compromise approach between the linear time trend model and very complex numerical simulations. In this work, we discuss a statistical model that is easy to apply but that encodes some physical intuition for the problem that makes the model interpretable and hopefully applicable over longer time periods. The goal is to show that even simple models incorporating limited physical information can provide more insight about temperature trends and their uncertainties given the observed data than can regression models in time.
A commonly used, very simplified physical model for the response to an instantaneous change in radiative forcing is that temperatures approach their new equilibrium in exponential decay. That is, writing F inst (t) for a step function that changes at time t = 0, where λ is the change in equilibrium temperature, µ 0 is the mean temperature in the baseline state, and ρ controls the rate at which the changes in temperatures approach λ, taking values between zero (instantaneous response time) and one (infinite response time). Equation (3) represents the response function for a linear model of temperature change, so is a natural first approximation for the evolution of temperature in the case of small perturbations from steady state (e.g., MacKay and Ko (1997)). In particular, equation (3) can be interpreted as the solution to a simple energy balance model that makes two assumptions: first, that the equilibrium temperature change is linear in the forcing (i.e., that the linear forcing feedback model holds), and second, that the rate of warming is approximately proportional to the heat uptake. The model is overly simplified because the Earth shows responses at multiple timescales (e.g., Held et al. (2010); Olivié et al. (2012); Geoffroy et al. (2013) and others) but can provide a reasonable approximation for the response on timescales shorter than those associated with full equilibration. In any case, when convolved with a time-varying forcing trajectory, the resulting model for the systematic response is an infinite distributed lag model in the forcing trajectory with weights decaying exponentially (e.g., Caldeira and Myhrvold (2013); Castruccio et al. (2014)). Models based off of (3) have previously been considered for analyses of the observed global mean temperature record (e.g., Rypdal and Rypdal (2014) and Zeng and Geil (2016)).
We also use a model based off of (3) for the systematic temperature response in the observed data: where In model (4), λ A and λ N represent "sensitivities" to anthropogenic and natural forcings, F A and F N , respectively, and have units of degrees Celsius temperature change per forcing change F 2× (the forcing associated with a doubling of atmospheric CO 2 , approximately 3.7 W/m 2 ). The parameter λ A is similar to the equilibrium climate sensitivity 3 , but will be estimated as somewhat lower than that quantity, in part because the proposed model contains only a single timescale of response to anthropogenic forcing (see the Appendix). Caldeira and Myhrvold (2013) and Castruccio et al. (2014) used multiple timescales in modeling longer series from GCM output, but we cannot distinguish these with only 136 years of data and given a smooth past trajectory of anthropogenic forcings. Response timescales to anthropogenic and natural forcings are set by the parameters ρ A and ρ N (taking values between zero and one). Model (4) should approximate temperature trends reasonably well up to centennial timescales, but not at the millennial timescales at which the deep ocean mixes. Our approach differs from those in Rypdal and Rypdal (2014) and Zeng and Geil (2016) in some important ways. Zeng and Geil (2016) assume that net radiative forcing is a constant multiple of CO 2 forcing, treat the ρ parameter as known, and do not account for uncertainty due to internal variations that are correlated in time. Rypdal and Rypdal (2014) suggest replacing the standard exponential decay response used in model (4) with a power-law decay response. The basis for this suggestion is an implicit assumption that we do not make: that the same simplified model for the climate's systematic response to radiative forcing should be used to model residual, internal variability as a response to white noise forcing. More details on our uncertainty quantification for model (4), including our model for internal variability, are given in below in Section 4.
In building model (4), we chose to separate natural and anthropogenic forcings because they seem not strictly comparable (Figure 1, right). First, there may be some evidence that aerosol forcing from volcanic eruptions is less efficacious than CO 2 forcing (Sokolov, 2006;Hansen et al., 2005;Marvel et al., 2015). While we use estimates of effective radiative forcing, which should compensate for efficacy, these estimates do not include an adjustment for the volcanic forcing. Moreover, the timescales of response associated with these forcings may also be different, possibly because ocean heat content responds differently to sudden and/or negative changes in forcing (as produced by volcanic eruptions) compared to more gradual and/or positive changes (as in continued anthropogenic emissions of CO 2 ) (e.g., Gregory and Forster (2008); Padilla et al. (2011)). We combine solar and volcanic forcings out of convenience; the solar forcings do change more rapidly than the anthropogenic forcings, and in any case the magnitude of the changes in solar forcing is small.
To illustrate the use of model (4), we fit it to different segments of the observed global mean surface temperature record and we compare to the linear fits estimated over the same timeframes. Figure 2 shows the resulting fitted models using the data from 1970-2015, 1950-2015, and 1880-2015. The estimated trend from model (4) is relatively insensitive to the timeframe used. The estimated linear time trends, on the other hand, differ markedly using different timeframes, and agree with model (4) only after around 1970, during the time period over which net radiative forcing was evolving approximately linearly in time (see again Figure 1). The sensitivity of the inferred linear trend in global mean temperature to the starting date has been previously discussed (e.g., Liebmann et al. (2010)). These results suggest both that model (4) does indeed capture important aspects of the underlying physical processes driving temperature trends and that it therefore may be used to answer more interesting questions than can the linear time trend model. 4 Trend and uncertainty: what can we learn from applying our simple model to data?
In this section, we illustrate what can be learned by applying the simple model (4) to observed temperatures. To do this, we must introduce an additional model to capture internal variability ( (t) in the assumed true model (1)). We then use our full model to infer the parameters in model (4), to evaluate their uncertainties given the data, and to explore the implications for understanding temperature trends.
To diagnose features of internal variability, spectral analysis is an intuitive framework, since the frequency properties of internal variability are tied to uncertainties in trends: uncertainty in smooth trends is more strongly affected by low-frequency than high-frequency internal variability. Figure 3, left, shows the raw periodogram associated with the residuals from model (4) and a smoothed estimate of the spectrum. We also show the spectra associated with fitting the residuals to two models for the internal variability. Both are standard autoregressive-moving average (ARMA) models (Appendix Section A4 includes a definition), allowing for dependence between the noise at each time step with the noise for past values. The two models we compare are an ARMA(4,1) and an AR(1) model, which were chosen according to common information criteria used for selecting time series models: respectively, the small-sample corrected Akaike information criterion (AICc) (Hurvich and Tsai, 1989) and the Bayesian information criterion (BIC) (Schwarz, 1978) (See Table A2 for the coefficient estimates, innovation standard deviations, and AICc and BIC associated with these two models.) Both models appear to fit the data reasonably well; the ARMA(4,1) model arguably overfits at the higher frequencies, but the AR(1) model may be underestimating variability at the lowest frequencies. Since low frequency variability is most important to uncertainties in smooth trends, we adopt the more conservative choice  (4) trends are extended back to 1880 regardless of the timeframe used to fit the model.) The linear model appears in agreement with model (4) roughly after 1970, but not before. By contrast, model (4) produces fairly stable estimates of the mean response during the 20th century, although we note that the apparent fit to the data may be slightly poorer in the earliest part of the record.  (4), a filtered periodogram (twice applying a moving average of width 5), and the power spectra associated with fitted ARMA(4,1) and AR(1) models. The dashed lines are at ±2 standard errors associated with the filtered periodogram. The ARMA(4,1) model appears to more realistically represent low-frequency variation in the residuals, which is crucial for inferences about trends. Right, normal quantile-quantile plot for the ARMA(4,1) sample innovations. There is some evidence that the innovations are light-tailed compared to the normal distribution. of using the ARMA(4,1) model. Figure 3, right, shows the normal quantile-quantile (Q-Q) plot for the sample innovations from this model; there is evidence that the innovations are somewhat more light-tailed than Gaussian, so standard errors based on a Gaussian assumption should not be overoptimistic.
Using the fully parametric model, combining (4) and a Gaussian ARMA(4,1) noise model, we proceed with uncertainty quantification through a parametric bootstrap 4 . When applied to our model of the historical temperature record, the parametric bootstrap distribution shows, unsurprisingly, that in a relatively short time series and given a smooth past trajectory of forcings, it is difficult to distinguish between a climate with both a high sensitivity (large value of λ) and slow response (large value of ρ) versus one with a lower sensitivity (smaller value of λ) but a faster response (smaller value of ρ). The estimatesλ A andρ A are therefore strongly dependent, withλ A increasing explosively asρ A → 1 (Figure 4 shows their bivariate parametric bootstrap distribution). The strong nonlinear relationship between these two parameters, and the high degree of skewness in the marginal distribution ofλ A , are the reasons that we rely on bootstrapping for uncertainty quantification, as the typical appeals to asymptotic normality are not viable in this setting.
In the following, we will represent uncertainties using the simple bootstrap percentile method. The percentile method is subject to criticism (e.g., Hall (1988)). We have found, however, that adjusting the percentile method using a nested bootstrap results in narrower It is difficult to distinguish between the rate of response and the sensitivity using only global mean temperatures from recent history.  (4), λ A and λ N , to anthropogenic and natural sources, respectively (in units • C per doubling). The data appear to provide a lower bound for λ A , but cannot rule out even implausibly large values; the very large values are associated with slow responses (see Figure 4). The data cannot rule out λ N = 0, likely because there are few prominent volcanic eruptions in the historical record analyzed and the response to volcanic aerosols may be small. confidence limits. For this reason, we believe that the raw percentile-based intervals may be conservative in this setting, so we choose to report the apparently conservative intervals.
Our point estimates are based on a two-step procedure wherein model (4) is estimated via least squares and the ARMA(4,1) model is estimated via maximum likelihood on the residuals. While this procedure may be somewhat suboptimal compared to jointly estimating the mean and covariance structure, the two-step procedure is substantially faster, which is important for carrying out the nested bootstrap.

Uncertainties in the sensitivity parameters
When using the full 1880-2015 global mean surface temperature record, the point estimate for the centennial-scale sensitivity to anthropogenic forcing isλ A = 1.8 • C per doubling, withρ A = 0.80 (which implies mixing on decadal timescales). The estimated sensitivity to natural forcing is much smaller,λ N = 0.21 • C per doubling withρ N = 0.58. In this section, we discuss uncertainties in these estimates. Using our statistical model, the historical data appear to provide a lower bound for λ A (assuming for now that the forcings are known) but cannot rule out extremely large and implausible values on the order of tens or hundreds of degrees per doubling (the 2.5-97.5th bootstrap percentile interval is 1.5 to 690 • C per doubling). These very large values are not supported by evidence from the paleoclimate record (e.g., IPCC (2013) Chapter 5) and the approximation of the linear forcing feedback model, on which (4) implicitly relies, breaks down under high sensitivity (Bloch-Johnson et al., 2015). The large upper bound should therefore be interpreted only as a statement about the information in the historical global mean temperature data under the strict assumption that model (4) holds exactly. For the sensitivity to natural forcings, the data cannot rule out λ N = 0 (the 2.5-97.5th bootstrap percentile interval is -1.1 to 4.1 • C per doubling). Table 1 gives some intervals at different percentiles for the parameters λ A and λ N . (The parameters ρ A and ρ N are essentially unconstrained by the data.) The IPCC's own 66% "likely" interval for equilibrium sensitivity is 1.5 • C to 4.5 • C per doubling, which subjectively combines estimates from various sources using multiple lines of evidence including from ensembles of models with different physics, and accounts for other sources of uncertainty that we have so far ignored, such as uncertainty in the forcings themselves (see IPCC (2013) 10.8.2 and Box 12.2). The bulk of the distribution of our estimate is somewhat narrower than the IPCC's, likely because so far we have not accounted for uncertainty in radiative forcings; however, the IPCC rules out the very large and unphysical values in the right tail of our intervals. We here stress again that since we are estimating the centennial-scale response, the estimates of the sensitivity that we provide will tend to be lower than the equilibrium sensitivity estimated in the IPCC's interval (see the Appendix, which also compares our estimates to other observationally based estimates of a sensitivity parameter). Individual estimates of the equilibrium climate sensitivity and associated uncertainty in the literature are discussed in IPCC (2013) Section 10.8.2 (see also again the Appendix).
The main source of uncertainty in the upper bound for λ A is due to uncertainty in the "equilibration time" of the climate associated with smoothly increasing anthropogenic forcing, controlled by ρ A . If we restrict ρ A to, say, centennial scales or smaller, then the uncertainty is substantially decreased (see again Figure 4). One may argue through other lines of evidence and reasoning that extremely large values of ρ A and λ A are implausible, but the statistical model is being used to quantify the information content of the historical temperature record. The fact that additional sources of information are needed to exclude unphysical values suggests that even our minimally informed model may be overly empirical for some purposes. The inability of the data to rule out λ N = 0 is, on the other hand, probably due to the fact that there are few prominent volcanic eruptions in the historical record analyzed and that the response to volcanic aerosols may be small for the reasons discussed above.

Uncertainties in near-and long-term trends
The uncertainties in the sensitivity and rate of response parameters imply greater uncertainties in projected longer-term future trends in global mean temperature than in the historical and near-term projected trends. To illustrate this, we examine the implied future trends under the hypothetical (extended) RCP8.5 scenario, in which radiative forcing increases and then stabilizes in the year 2150. We simulate new time series using our estimates of model (4) and the ARMA(4,1) noise model, given radiative forcings from this scenario. The projected trend and associated pointwise uncertainties are shown in Figure 5.
Projected mean temperatures, and especially their associated uncertainties, continue to increase even after stabilization of forcing. This is a consequence of the joint uncertainty in λ A and ρ A , and in particular of the inability to rule out implausible values of these parameters. If the goal, then, is to provide a long-term projection of mean temperatures given only the historical temperature record, these estimates will unsurprisingly be quite uncertain (even assuming known past and future forcings).
On the other hand, trends in the historical and near-term response are much more certain. The observations strongly suggest that mean temperatures increased in the 20th century: for example, the (2.5-97.  (4) and assuming ARMA(4,1) noise. The black curve shows the observed temperatures. Intervals are pointwise (2.5-97.5)-percentile intervals. Radiative forcing stabilizes in the year 2250, but mean temperatures and especially their uncertainties continue to increase. While uncertainties in the long-term response are quite large, due largely to the inability to rule out implausible values of λ A , the historical and near-term response is much more certain.

Decreasing uncertainty in the sensitivity parameter as more data is observed
We have shown that the short historical temperature record alone produces fairly uncertain estimates of the sensitivity parameter, λ A in model (4) (Figure 4, Table 1), and therefore of longer-term temperature trends ( Figure 5). We now examine how these uncertainties decrease as the temperature record increases (as in, e.g., Kelly and Kolstad (1999) and others). To do this, we artificially extend the temperature record by generating new synthetic time series using the mean and noise models estimated from the historical data and forcings from the same RCP8.5 scenario described above. We then reestimate model (4) for each synthetic series, using successively longer synthetic datasets. The results suggest that the data will not constrain the upper bound on the sensitivity parameter until another ∼50 years, by which time (under our estimated model and the RCP8.5 scenario) temperatures will have already risen by about 3 • C from the preindustrial climate. A summary of the evolution of uncertainties is given in Table 2. These estimates could be more strongly constrained by using additional physical information. As discussed previously, the very high sensitivity estimates in the bootstrap distribution are cases where the estimated response time is unphysically long. Without external information about this timescale, however, long data records are required to rule out the large values of ρ A and λ A that the model entertains.

Is there evidence of long memory internal variability in global mean temperatures?
One of the complicating factors in estimating trends in climate time series is the question of whether global mean temperatures exhibit long memory. Long memory processes have power spectra that behave like (2 sin(ω/2)) −2d as the frequency ω → 0, with d > 0 (i.e., infinite power at the frequency zero). When d < 1 the process has finite total variance, as would be expected for a variable like global mean temperature. For more details on long memory processes, see Beran et al. (2013). By contrast, short-memory processes (like the ARMA(4,1) model we assume), have finite power at the origin. Many authors have suggested that internal temperature variability is well-modeled by processes with long memory but finite variance (e.g., Frankignoul and Hasselmann (1977), Bloomfield (1992), Smith and Chen (1996), Gil-Alana (2005), Lennartz andBunde (2009), Fraedrich et al. (2009), Rypdal and Rypdal (2014), Løvsletten and Rypdal (2016), and many others). Some authors have moreover claimed that global mean temperatures are well-modeled by a random walk (e.g., Gordon (1991)), as has at least one standard time series textbook (Shumway and Stoffer, 2013), which would imply that global mean temperatures do not have a finite variance over time. In either case, if the Earth's temperatures exhibited long memory, it would be more difficult to estimate trends than in the short memory case, since low-frequency variability would then be more difficult to distinguish from trends. The evidence for long memory, however, strongly depends on the assumed trend model. Many of the aforementioned authors draw their conclusions by assuming a linear time trend model and applying that model to the temperature record on durations of decades to over a century. (One notable exception is Rypdal and Rypdal (2014), but they only compare long memory noise models to AR(1) models.) As discussed previously, a linear trend model applied to a time series with a nonlinear trend will imply excessive low-frequency noise. Figure 6 shows the periodograms of the residual global mean temperatures after removing either a linear time trend or a trend of the form of (4). While the high-frequency behavior of the residuals is not much affected by the choice of trend model, the low-frequency behavior is very much affected. Apparent low-frequency variability is made more severe by assuming that mean temperatures increase linearly in time.
The question of long memory cannot be definitively settled using a dataset of only 136 observations, and other analyses make use of longer climate model runs or the paleoclimate record (e.g., Mann (2011)). Nevertheless, it should be clear that the linear time trend model is especially problematic for this purpose. In general, regression models in time, linear or otherwise, have a danger either of mistaking systematic trend for apparent low frequency variability (as just described), or of mistaking low-frequency variability for systematic trend (as would occur, for example, when using a nonparametric regression with too small of a smoothing bandwidth). Either can lead to misstated uncertainties, and therefore can be problematic even if the claim is that the trend model is only being used to test for significant warming and that the model is not believed to be true.

Implications of uncertain inputs: radiative forcing and temperatures
The analysis thus far has assumed that both radiative forcings and temperatures are known exactly, but uncertainty in the sensitivity and in trends also propagates from uncertainty in these quantities. We therefore discuss at least roughly the potential implications of imperfect knowledge of these inputs.
Of the two factors, uncertainty in radiative forcings, particularly from aerosols, is more consequential, especially for the inferred lower bound of the sensitivity parameter from model (4). The importance of radiative forcing uncertainty for uncertainties in climate sensitivity has been widely noted (e.g., Gregory and Forster (2008); Padilla et al. (2011);Otto et al. (2013); Masters (2014); Lewis and Curry (2015); Myhre et al. (2015)). The trajectory of net effective radiative forcing from anthropogenic sources is poorly known; the IPCC states that the difference in net effective radiative forcing from anthropogenic  (2) (dashed) and (4) (solid) fit to the full data record. There is substantially more low-frequency variability in the residuals from the linear trend model than in the residuals from model (4). Misspecified mean models will give a misleading impression of low-frequency variability, and therefore misleading uncertainties associated with the mean trend.
sources between the years 2011 and 1750 was about 2.29 W/m 2 with a 95% confidence interval of 1.13 to 3.33 W/m 2 . We explore the implications of this uncertainty by simply scaling the entire trajectory of net radiative forcings from 1750 to the present such that the uncertainty in 2011 is as stated. Adjusting aerosol forcings to the high or low ends, respectively, produces sensitivity estimates from (4) varying by over a factor of two, from 1.2 to 3.7 • C per doubling (vs. the original estimate 1.8 • C). The aerosol uncertainty appears important: the value 1.2 • C per doubling is lower than all of the bootstrap values of λ A generated assuming known forcings ( Figure 4 and Table 1). Uncertainties in the global mean surface temperature record are comparatively less important. To partially address this issue, we re-estimate model (4) using each of the 100 ensemble members of the HadCRUT4 global mean temperature ensemble. The point estimates of λ A range from 1.5 to 2.1 • C per doubling, with estimates of ρ A ranging from 0.79 to 0.90. The point estimates of λ N range from 0.71 to 20 • C degrees per doubling, with estimates of ρ N ranging from 0.86 to 0.996. The additional uncertainty induced by observational uncertainty in the temperature record is smaller than either the uncertainty induced by internal temperature variability or the uncertainty in radiative forcings.

Bayesian methods
Some of the uncertainties discussed so far could be addressed in a Bayesian framework (as in, e.g., Forest et al. (2002), Forest et al. (2006), Padilla et al. (2011), andAldrin et al. (2012)) although we do not pursue that approach here. For example, the effects of uncertainties in temperatures and radiative forcings could be modeled hierarchically, and outside information could be used to constrain long-timescale responses not informed by the data. We have instead chosen here to focus on the information contained in global mean temperature observations about trends to illustrate some important sources of uncertainty in empirical analyses even assuming known inputs.

Parametric vs. nonparametric uncertainty quantification
In Sections 3 and 4, we showed that analyses of the Earth's systematic temperature response are better informed by making physical assumptions than by an ostensibly more empirical approach. In this section, we ask a similar question about characterizing internal variability, in settings similar to that discussed, where data are temporally correlated and limited in length. We used a Gaussian ARMA(4,1) noise model in Section 4; while we do not claim a physical motivation for this model, the model does make relatively strong statistical assumptions. Is it more robust to assume a low-order parametric model for the noise, as we have done, or to adopt a nonparametric approach? The answer to this question depends on the length of the data record and the nature of the internal variability.
For the purposes of this illustration, we will use not the actual temperature record but rather some simple synthetic examples. We consider several artificial, trendless time series (the true mean of the process is constant) with temporally correlated noise, and evaluate the results of testing for a linear time trend (i.e., fitting model (2) and testing against the null hypothesis that β = 0) using different parametric (both correctly specified and not) and nonparametric methods for estimating uncertainties. The ordinary least squares standard errors, which assume uncorrelated noise, tend to be anticonservative (too small) and time series methods are supposed to ameliorate this overconfidence, but may or may not be successful in this regard depending on the context. The illustrations are simple but our conclusions should be relevant to actual data analysis and to more informed models.
We consider a few parametric approaches common in time series analysis. The typical practice is to assume that the noise follows a low order model, such as an ARMA model. Ideally, the noise model would be chosen (as in Section 4) in consultation with diagnostic plots and an information criterion like the AICc that balances model fit with the number of statistical parameters. However, it is common practice to automatically select a model either solely by minimizing the AICc or another criterion, or just to assume a simple time series model, often the AR(1) model. Uncertainty in the trend parameter(s) can then be estimated in several ways. Usually, this is done assuming asymptotic normality for maximum likelihood estimators, and a test might be carried out using a t-statistic. If asymptotic normality is not viable, as for model (4), an alternative is to resort to using a parametric bootstrap (again as we do in Section 4) and to perform the test using bootstrap p-values.
We also evaluate the perhaps most typical nonparametric method for accounting for dependence in a time series, the block bootstrap (Kunsch, 1989). Here, residuals are resampled in blocks to generate bootstrap samples that retain much of the dependence structure in the original data. A popular variant is the circular block bootstrap (Politis and Romano, 1992), in which blocks can be overlapping and blocks starting at the end of the time series wrap back to the beginning. The block bootstrap has been a common method in the climate and atmospheric sciences for at least two decades (Wilks, 1997) and has been applied previously to test for time trends (or changes thereof) in the global mean surface temperature record.
For example, Rajaratnam et al. (2015) argue that the circular block bootstrap gives better uncertainty estimates than does a parametric analysis in the setting of testing for a trend in a 16-year segment of the global mean temperature record.
While the block bootstrap works very well in some settings, the procedure is not free of assumptions. Like other variants of the nonparametric bootstrap, its justification is based on an asymptotic argument: for the block bootstrap to work well, the size of the block has to be small compared to the overall length of the data but large compared to the scale of the temporal correlation in the data. When the overall data record is short and internal variability is substantially positively correlated in time, as for the historical global mean temperature record, these dual assumptions may not both be met and we should not expect the block bootstrap to perform well.
In the following, we compare five methods (four parametric and one nonparametric) for generating nominal p-values testing for a significant linear time trend in trendless synthetic data: (a) a t-test assuming independent noise, (b) a t-test assuming Gaussian AR(1) noise, (c) bootstrap p-values from a parametric bootstrap assuming Gaussian AR(1) noise, (d) a t-test assuming Gaussian ARMA(p,q) noise with the order chosen by minimizing AICc, and (e) bootstrap p-values from a circular block bootstrap. We generate synthetic data from three different time series models, but in cases (b) and (c), the assumed parametric model is always AR(1). (Table A3 summarizes the models from which we simulate, which were chosen in part to share a similar strength of correlation with the observed global mean temperature record.) In cases (a), (b), and (d), the t-test degrees of freedom are approximated by n−2 minus the number of parameters in the noise model. The parametric models in cases (a) through (d) are all estimated via maximum likelihood. In case (e), we estimate p-values using a few different block lengths and show the results most favorable to the block bootstrap.
After generating nominal p-values, we evaluate the performance of the different methods. Since the null hypothesis is true in this artificial setting (the synthetic series are trendless), a correct p-value would be uniformly distributed. Deviations from the uniform distribution would then be a sign that the procedure generating the nominal p-value is uncalibrated. We therefore use Q-Q plots to compare the distribution of the simulated nominal p-values with the theoretical uniform distribution. Uncertainties are underestimated when the nominal p-value quantiles are smaller than the theoretical quantiles (i.e., when the Q-Q plot is below the one-to-one line). In this situation, inferences are anticonservative and the tests using the selected method will have Type I error rates that are larger than the nominal rate.

Parametric vs. nonparametric methods under a correctly specified noise model
We first compare the performance of the five methods for generating nominal p-values in the setting where the assumed AR(1) model is correctly specified (Figure 7). Unsurprisingly, pre-specified parametric time series methods give reasonably calibrated inferences when the parametric model is correctly specified (Figure 7, rows 2 and 3), although maximum likelihood gives somewhat anticonservative estimates of uncertainty with small sample sizes. The anticonservative bias of the maximum likelihood estimator can be reduced by instead using restricted maximum likelihood (REML) (e.g., McGilchrist (1989)) (see Appendix A5), but we focus here on the performance of maximum likelihood because it is more usually the procedure employed. It is also unsurprising that p-values under assumed independence are quite uncalibrated and anticonservative in this setting (Figure 7, top row), because standard errors are underestimated when positive temporal correlation is ignored. It may, on the other hand, be surprising that automatically chosen parametric methods and nonparametric methods (blind selection via AICc and the block bootstrap, Figure 7, rows 4 and 5) can perform even more poorly than assuming independence if the sample size is very small. The AICc does improve on the AIC (not shown) by attempting to account for small sample biases, but still performs poorly in very small samples. The (nonparametric) block bootstrap is the worst-performing method at the smallest sample sizes while ostensibly based on the weakest assumptions. For either of these methods to perform comparably to the pre-specified parametric methods, sample sizes must be quite large, indeed in this illustration larger than the available global mean temperature record. This should serve as a warning against using these methods for time series of modest length.
In actual practice, it can be advantageous, as we already discussed, to choose a noise model not automatically but in consultation with diagnostics (such as by comparing theoretical spectral densities with the empirical periodogram). In Section 4, we chose a noise model with consideration for the model's representation of low-frequency variability. In that example, the chosen model did minimize the AICc, but gave more conservative inferences than the AR(1) model (which was chosen by BIC). The illustration above shows that this behavior is not generically true and that for small sample sizes it is dangerous to blindly select models by minimizing an information criterion alone.

Parametric vs. nonparametric methods under a misspecified model
The comparisons in the previous section were too favorable to the pre-specified parametric methods because the order of the specified noise model (an AR(1) model) was known to be correct. Now we compare these methods when the assumed noise model is misspecified. The performance of misspecified methods will depend in particular on how the misspecified model represents low-versus high-frequency variations in the noise process. Models that underestimate low-frequency variability will tend to be anticonservative for estimating uncertainties in smooth trends, whereas those that over-estimate low-frequency variability will tend to be conservative. We therefore repeat the illustrations in the previous section generating the synthetic time series from two different noise models (but still using the pre-specified AR(1) model to generate nominal p-values). The two models are chosen so that their best AR(1) approximations either under-or over-represent low-frequency variability. Figure 8 shows the spectra corresponding to these two noise models and the best AR(1) approximation to each, and Figures 9 and 10 show Q-Q plots corresponding to how the various time series methods perform in these two settings. See Table A3 for the model parameters and Section A3 for a comparison under two additional models, including when the true model is the ARMA(4,1) model we assume in our main analysis of the global mean temperature record.
First, we consider an ARMA(1,1) process whose best AR(1) approximation over-represents low-frequency variability (Figure 9). The results are similar to those when the model was correctly specified, except that at the largest sample sizes the pre-specified parametric methods slightly overestimate uncertainties, for the reasons discussed above. As before, both blind selection via AICc and the block bootstrap perform well for large sample sizes but very poorly for the smallest sample sizes.
Second, we consider a fractionally integrated AR (1) Figure 7: Quantile-quantile plots comparing the distribution of nominal p-values to the theoretical uniform distribution. Simulations are of mean zero, Gaussian AR(1) time series and p-values correspond to a two-sided test for a linear time trend. The length of the time series is given above each column. In the first row, the p-values are from the OLS t-test assuming independent noise; the second row is the t-test assuming AR(1) noise; the third row uses a parametric bootstrap (again assuming AR(1) noise); the fourth row uses a t-test assuming ARMA(p,q) noise with the order of the model selected by AICc; and the last row uses a circular block bootstrap with block size chosen to be favorable to this method. The p-values calculated assuming the correct parametric model appear approximately correct for modest sample sizes and always outperform both blind selection by AICc and the block bootstrap. The latter two methods can be worse than assuming independence when sample sizes are very small. True Spectrum Best AR(1) Approximation Figure 8: The power spectra associated with the two models from which we generate synthetic time series in Section 5.2, along with spectra of the best AR(1) approximations (in Kullback-Leibler divergence) to both models; see Table A3 for noise model parameters. The corresponding illustrations of the performance of the various time series methods are shown in Figures 9 and 10. The AR(1) model will tend to overestimate low-frequency variability in the first case and underestimate it in the second. memory process, the best AR(1) approximation (and indeed any ARMA model) will severely under-represent low-frequency variability ( Figure 10). All of the methods struggle in this setting and produce anticonservative estimates, but the pre-specified parametric methods still typically perform better than the ostensibly more flexible methods.

Frequency
These results confirm that approaches to representing noise that appear to weaken assumptions are not guaranteed to outperform even misspecified parametric models. Misspecified parametric models are most dangerous when low-frequency variability is underrepresented, but methods like the block bootstrap will also have the most trouble when low-frequency variability is strong because very long blocks will be required to adequately capture the scale of dependence in the data. While it is crucial for the data analyst to scrutinize any assumed parametric model, we believe that in many settings when the time series is not very long relative to the scale of correlation, one will be better served by carefully choosing a low-order parametric model rather than resorting to nonparametric methods.
Note that this illustration uses synthetic simulations that are relatively strongly correlated in time, a feature of the global mean temperature record. Nonparametric methods can work better than illustrated here in settings where correlations are weaker. For example, local (rather than global) temperatures, at least over land, tend to be more weakly correlated in time. In general, it is useful to evaluate the performance of of statistical methods with simulations that share characteristics with the relevant real data.

Discussion
We have sought to show here that targeted parametric modeling of global mean temperature trends and internal variability can provide more informative and accurate analyses of the global mean temperature record than can more empirical methods. Since all analyses involve assumptions, it is important to consider the role that assumptions play in resulting conclusions. In the setting of analyzing historical global mean temperatures, where the data record is relatively short and temporal correlation is relatively strong, ostensibly more empirical methods can fail to distinguish between systematic trends and internal variability, and can give seriously uncalibrated estimates of uncertainty. While linear-in-time models can be used for some purposes when applied to moderately narrow time frames (and with careful uncertainty quantification), the demonstrations shown here suggest that they do not have an intrinsic advantage over more targeted analyses. Targeted analyses can be used over longer time frames -allowing for better estimates of both trends and noise characteristics -and can address a broader range of questions within a single framework. The model we use in our analysis provides insights about the information contained in the historical temperature record relevant to both shorter-term and longer-term trend projections. The limited historical record of global mean temperature can provide information about shorter-term trends but unsurprisingly cannot constrain long-term projections very well. The past 136 years of temperatures simply do not alone contain the relevant information about equilibration timescales that would be required to constrain long-term projections, especially when aggregated to a single global value. (Use of spatially disaggregated data may provide additional information.) The distinction between uncertainties in shorter-term and longer-term projections, itself not easily made using a time trend model, serves to further illustrate that while the historical data record is an important source of information, it alone cannot be expected to answer the most important questions about climate change without bringing more scientific information to bear on the problem.
We believe that our discussion is illustrative of broader issues that arise in applied statistical practice, and will have particular relevance to problems involving trend estimation in the presence of temporally correlated data and in relatively data-limited settings, common in climate applications. We suspect that many applied statisticians have personally felt the tension between targeted modeling on the one hand and more empirical analyses on the other. One lucid discussion of the broader issues surrounding this tension can be found in the discussion of model formulation in Cox and Donnelly (2011). Empirical approaches have very successfully generated new insights and predictions in important areas, but there is a wide range of scientific problems where these approaches do not perform well and where targeted, domain-specific modeling is required. Statisticians can bring important insights to scientific problems; a crucial role for the statistician is to consider the modeling choices that will be both the most illuminating and the most reliable given the scientific questions and data at hand.

Data availability
The NASA GISS Land-Ocean Temperature index is updated periodically; the data we analyze was accessed on the date 2016-02-03. The current version is available at http://data.giss.nasa.gov/gistemp/. The HADCRUT4 data, used in Section 4.5, is available at http://www.metoffice.gov.uk/hadobs/hadcrut4/data/ Historical radiative forcings until 2011 are available in IPCC (2013)  Table A1: Comparison of estimates of a sensitivity parameter from studies that use observational data and a simple energy balance approach. The large best (median) estimate from Gregory et al. (2002) is due to a very fat right tail in their analysis; the mode of their distribution is 2.1 • C per doubling. The estimates given in these studies are similar to our estimates of λ A (which should be smaller than the equilibrium sensitivity).

Study
Best We can also compare our estimate of λ A to prior estimates of a sensitivity parameter that also use the historical temperature record. The most typical approach in the literature shares some commonalities with our method, beginning with the same linear forcing-feedback model that implicitly underlies our analysis, but estimating a sensitivity parameter by using an additional observational estimate of global heat uptake. That is, studies use estimates of changes in forcing, F , heat uptake Q , and historical temperature change, ∆T , between a base and a final period to computeL = (F −Q )/∆T and thereforê λ = F 2× ∆T /(F − Q ). Studies using this method include Gregory et al. (2002), Otto et al. (2013), Masters (2014), and Lewis and Curry (2015). The resulting sensitivity parameter estimate should be similar to our λ A . Table A1 shows the results from these analyses; the central estimates are indeed similar to our estimate of λ A . The uncertainty ranges given in these studies, however, also tend to be slightly larger than the intervals we give for λ A , again because these authors attempt to account for radiative forcing uncertainty in their analysis.
A distinguishing feature of this other, common approach is that it includes data about historical heat uptake, in addition to temperature and radiative forcing data. Ocean heat content is an additional, albeit uncertain, source of information that may improve estimates. On the other hand, since these methods do not involve an explicit trend model and require averaging the inputs over decadal or longer timespans, they cannot use the historical temperature record to estimate internal temperature variability. Most studies therefore estimate internal variability using climate model output, but climate models do not perfectly realistically represent even present-day variability in global annual mean temperature.  Table A2: Coefficient estimates (with standard errors in parentheses), innovation standard deviations, and AICc and BIC associated with ARMA(4,1) and AR(1) fits to the residuals from model (4). These two models minimize AICc and BIC, respectively. We use the ARMA(4,1) model because this model appears to better represent the low-frequency variation in the residuals (see Figure 3), which is crucial for inferences about trends.
By contrast, an advantage of our approach is that it allows one to use the historical data to understand internal variability. Additionally, our approach allows one to answer questions about both historical trends and longer-term projections in the framework of one statistical model, whereas the approaches discussed above do not allow one to infer trends in increasing-in-time forcing scenarios. A disadvantage of our approach is that, as discussed above, we rely on the historical global mean temperature record to estimate the "equilibration" timescales (ρ A and ρ N ), but the data contain little information about these quantities.
Regardless of the different advantages and disadvantages just discussed, both approaches to using the historical temperature record give similar results concerning the sensitivity parameter, and uncertainties in this parameter remain high. This demonstrates the limitations of the information content of the historical global mean temperature record alone for estimating longer-term projections of mean temperature changes. As noted in Section 6, spatially disaggregated data may contain more information. Table A2 gives information about the ARMA(4,1) and AR(1) models fit to the residuals of model (4) used in Section 4. The ARMA(4,1) model give more conservative inferences about the systematic trend, and is the model we adopt in Section 4 (see Figure 3).

A3 Performance of misspecified methods under AR(2) and
ARMA(4,1) simulations In Section 5.2, we compared the five methods for generating nominal p-values in two settings where the pre-specified AR(1) model was incorrect, where the AR(1) model either over-or under-represented low-frequency variability. Here we present two more comparisons. For these, Table A3 gives the model parameters and Figure A1 shows the two true spectra and the corresponding best AR(1) approximations. The first comparison is to an AR(2) noise model under which the AR(1) approximation over-represents variability at the lowest frequencies but under-represents variability at intermediate frequencies ( Figure A1, left).  (2) (left) and ARMA(4,1) (right) models considered in Section A3. See Table A3 for noise model parameters.
The second comparison is to the ARMA(4,1) model used in our main analysis, under which the AR(1) approximation under-represents variability at the lowest frequencies ( Figure A1, right).
First, we show results for the AR(2) true noise model model ( Figure A2). In this setting, all methods perform better than in the setting of Figure 9, but the relative performance of the different methods is largely the same as there, except that selection by AICc now performs even more poorly than the block bootstrap at the smallest sample sizes.
Second, we show results for the ARMA(4,1) model used in our main analysis (Figure A3). In this setting, the results are similar in nature to but less extreme than those from the fractionally differenced AR(1) model shown in Figure 10. All methods perform poorly but the parametric methods tend to perform better, especially when the time series length is short.
The behavior from both of these simulations again serves to emphasize that when making inferences about smooth trends, it is most crucial to represent low-frequency variability well.

A4 Model coefficients for simulations in Sections 5 and A3
In Sections 5 and A3, we generate synthetic, mean zero time series with different correlation structures. The models that we simulate from are summarized in Table A3. The general form for an autoregressive fractionally integrated moving average model (ARFIMA) of order (p, d, q) (of which all the simulated models are special cases) is where Y t is the time series at time t, the φ's are the AR parameters, the θ's are the MA parameters, d is the (fractional) differencing parameter, B is the backshift operator    Figure A4: Comparison of maximum likelihood and REML in the same context as Figure 7. The first row is the same as the second row of Figure 7. In the second row here, the estimation is instead done using REML. The REML standard errors give better calibrated inferences in small sample sizes.
(i.e., B k Y t = Y t−k ), and (t) are uncorrelated innovations with constant variance. (The convention is that the acronym is shortened to account for parameters that are set to zero, so for example an ARFIMA(1,0,0) model is called an AR(1) model.)

A5 Performance of maximum likelihood vs. REML
In Section 5, parametric inference was done using maximum likelihood estimators. As shown there, the MLE for covariance parameters can give anticonservative estimates of standard errors for trend parameters in small sample sizes. This problem can be substantially ameliorated using restricted maximum likelihood (REML) instead. Figure A4 repeats the tests in Section 5.1 (where both the true and assumed models are AR(1)) and compares the performance of maximum likelihood and REML. For larger sample sizes, the two procedures are comparable, but in small sample sizes REML is much better calibrated (although still slightly anticonservative in the smallest sample sizes).