Article Text

PDF

A bootstrap method to avoid the effect of concurvity in generalised additive models in time series studies of air pollution
  1. Adolfo Figueiras1,
  2. Javier Roca-Pardiñas2,
  3. Carmen Cadarso-Suárez3
  1. 1Department of Preventive Medicine, University of Santiago de Compostela, Spain
  2. 2Department of Statistics and Operations Research, University of Vigo, Spain
  3. 3Unit of Biostatistics, Department of Statistics and Operations Research, University of Santiago de Compostela
  1. Correspondence to:
 Dr A Figueiras-Guzmán
 Dto de Medicina Preventiva y Salud Pública, Facultad de Medicina, c/San Francisco s/n, 15705 Santiago de Compostela (A Coruña), Spain; aldolfo.figueirasusc.es

Abstract

Background: In recent years a great number of studies have applied generalised additive models (GAMs) to time series data to estimate the short term health effects of air pollution. Lately, however, it has been found that concurvity—the non-parametric analogue of multicollinearity—might lead to underestimation of standard errors of the effects of independent variables. Underestimation of standard errors means that for concurvity levels commonly present in the data, the risk of committing type I error rises by over threefold.

Methods: This study developed a conditional bootstrap methology that consists of assuming that the outcome in any observation is conditional upon the values of the set of independent variables used. It then tested this procedure by means of a simulation study using a Poisson additive model. The response variable of this model is a function of an unobserved confounding variable (that introduces trend and seasonality), real black smoke data, and temperature. Scenarios were created with different coefficients and degrees of concurvity.

Results: Conditional bootstrap provides confidence intervals with coverages close to nominal (95%), irrespective of the degree of concurvity, number of variables in the model or magnitude of the coefficient to be estimated (for example, for a concurvity of 0.85, bootstrap confidence interval coverage is 95% compared with 71% in the case of the asymptotic interval obtained directly with S-plus gam function).

Conclusions: The bootstrap method avoids the problem of concurvity in time series studies of air pollution, and is easily generalised to non-linear dose-risk effects. All bootstrap calculations described in this paper can be performed using S-Plus gam.boot software.

  • GAM, generalised additive models
  • BS, black smoke
  • air pollutants
  • computing methodologies
  • epidemiological research design
  • risk assessment

Statistics from Altmetric.com

In recent years, time series studies with Poisson regression using generalised additive models (GAMs)1–4 have been the reference method for analysing the short term health effects of air pollution. Lately, however, these models have been shown to suffer from an important limitation, namely, that there is underestimation of the standard error (SE) of the estimated effect of any given pollutant in those cases where concurvity is present in the data.5–8

Briefly, concurvity is the non-parametric analogue of multicollinearity, in which a function of one of the independent variables can be approximated by a linear combination of functions of the remaining variables, with these functions being estimated in the same way as the corresponding functions in the original model.6 It has been seen that, in the presence of concurvity in GAMs, confidence intervals (CI) are too narrow, p values are understated, and type I error rate is greater than that established a priori6,8 (for example, for a concurvity of 0.6—a value based on real data—type I error has shown a rise of almost threefold6). One way of eliminating the problem of underestimation of SE in the presence of concurvity is to have recourse to bootstrap resampling techniques. Accordingly, in this paper we propose the use of a conditional bootstrap method in GAMs to calculate valid CI for the estimated effect of any given pollutant. For the purpose of performing such calculations, we have developed a routine, termed gam.boot.

METHODS

Conditional bootstrap

In this type of bootstrap,9–13 B bootstrap replicates are generated. In each of these, the values of the independent variables are the same as those of the observed data, with only the values of the response variable being varied from replicate to replicate. The value assumed by the outcome in each observation is conditional (hence the technique’s name) upon the values of the set of independent variables in said observation. To this end, a Poisson model is first applied to the data for the sample, and the coefficient and predictions of the model are then obtained for each observation (also known as pilot estimates, as these are taken as reference).

To illustrate this procedure, it is as well to start with a simple model (see expression 1) having only two covariates X = (X1,X2), one of which, X1, is linearly related to the response via parameter β1 (the parameter of interest), and the other, X2, is related to the response via an unknown smooth function, f. Based on n observations (X1,Y1),…, (Xn,Yn), with Xi = (Xi1,Xi2), the following model is then fitted

Embedded Image

and thereafter the estimated coefficient β̂1 and predictions μ̂1,…μ̂n are obtained for each of the observations.

In a second step, the B conditional bootstrap replicates from b = 1 to B (for example, B = 1000) are generated, so that the values of the dependent variable, Yi, in each observation follow a random Poisson distribution with a mean of μ̂i, that is to say, Yi*bPoisson (μ̂i). In each of these B replicates, an estimate of the coefficient β̂1*b is obtained, using a GAM with the same number of degrees of freedom as the original model.

Finally, the 100percent×(1−α) limits for the confidence interval of β1 are given by (2β̂1−β̂11−α/2, 2β̂1−β̂1α/2) where β̂p represents the percentile p of the bootstrapped estimates β̂1*1,…,β̂1*B.

Bootstrap validation: a simulation study

Scenario

A three year (1096 day) series was constructed in which the number of events per day followed a Poisson distribution, in line with model (2) below,

Embedded Image

where Yt denotes the number of events per day, BSt is the black smoke, and tempt is the temperature recorded on day t. The coefficient β1 denotes the log relative rate of Y associated with increase in black smoke (BS). The BS is genuine and was drawn from data recorded for the city of Vigo (north west Spain) over the period 1996–1998 (see fig 1A). By default, the true β1 had a value of 0.001, although other values for this parameter (from 0.001 to 0.008) were also considered to ascertain the effect of coefficient magnitude on the 95% CI coverage and bias in the estimated coefficient, β̂1.

Figure 1

 Basis of the simulation. (A) Daily mean black smoke (BS) levels in Vigo for 1096 days, in the period 1996–1998. (B) Effect of unobserved confounding variable that has a seasonal and trend component. (C) Smoothed temperature effect (ftemp).

In model (2), trend is a function that introduces trend and seasonality into the simulated data to simulate an unobserved confounding variable in the data. In ecological time series studies, smooth functions (smoothing splines or LOWESS) of the time variable 1–4 are used to control for the confounding effect of these unobserved confounding variables. The trend function is a sinusoidal function (see fig 1B) that had already been used in other simulation studies.14–16 In (2), as in such studies, β2 assumes the value of 0.1. The function ftemp represents the functional form of the effect of temperature on mortality in Vigo (see fig 1C).

Concurvity generation

In our simulation study, the effect of interest was BS. The concurvity measure proposed by Ramsay et al6 was based on the correlation between the fitted values obtained from the GAM with pollution, BSt, as the response, and time and temperature as the predictors. Specifically, to assess the degree of concurvity in our data, one should: (a) fit the GAM

Embedded Image

where the partial functions g1 and g2 are estimated using smoothing splines with 7 and 4 degrees of freedom respectively, and εt is a zero mean error variable; and (b) then compute the squared correlation (R2) between BSt and, using the fitted values from (3), namely:

Embedded Image

In our real data, concurvity was 0.29. To vary the concurvity and thereby assess its influence on CI coverage and bias in the parameter estimate, β̂1, in model (2), the original BS was replaced by a new variable

Embedded Image

where ε̂t = BSt▒BSt, k1 and k2 were constants, such that 0⩽k1,k2⩽1 and 0⩽k1+k2⩽1. Note that BSst is a combination of the original BSt of the estimate ▒BSt and the errors ε̂t = BSt−▒BSt obtained from the model (3). Thus, by varying the constants k1 and k2, the degree of concurvity between BSs and the remaining independent variables may easily be modified, from 0 (k1 = k2 = 0) through 1 (k1 = 0, k2 = 1).

Data analysis

In our study, a number of scenarios were defined using different values for the coefficient β1 (from 0.001 to 0.009) and different values for k1 and k2, so that the degree of concurvity varied from 0.05 to 0.65. In each of the scenarios, 1000 samples {t,tempt, Yt} were generated with 1096 points in time (days) each (t = 1,…,1096) based on the model (2). In each sample, β̂1 was obtained on the basis of the estimated model (4)

Embedded Image

Lastly, the corresponding bootstrap and asymptotic 95% CI were calculated for β1. The CI coverages were calculated as the proportion of samples in which these included the true β1.

The estimates in (4) were obtained by using smoothing splines with 7 degrees of freedom per year for the estimated trend effect, f̂trend (following Dominici et al),5 and 4 degrees of freedom for estimated temperature effect, f̂temp.

RESULTS

Figure 1A shows the BS sequence for the city of Vigo, which was used as the independent variable to generate the replicates; fig 1B shows the unobserved confounding variable having trend and seasonality; fig 1C shows the functional form of the relation between temperature and mortality, which was obtained from previous estimates in a relation observed for Vigo.

In table 1, the CI coverages are compared for different degrees of concurvity using asymptotic GAM and GAM bootstrap. It will be seen that, with the asymptotic approach, increases in concurvity were accompanied by a progressive decline in coverage (see fig 2A), which decreased to as low as 80% when concurvity was 0.6. With the bootstrap method, however, coverage remained above 94% throughout, irrespective of the degree of concurvity. The coverage was also evaluated for different values of the true coefficient. With regard to the coefficient’s influence on coverage (see fig 2B), coverage proved seemingly independent of the magnitude of the coefficient, for the asymptotic and bootstrap methods alike.

Table 1

 Effect of concurvity on confidence interval coverage*. Results of the simulation study using asymptotic (standard method in gam software) and conditional bootstrap. True coefficient = 0.001

Figure 2

 Effect of concurvity (A) and magnitude of coefficient associated with black smoke (β1) on 95% CI coverage* (B). *Coverage = % of replicates that include the true value of the coefficient within the 95% CI of the true coefficient 0.001. Nominal coverage is 0.95.

DISCUSSION

The results of our study show that application of conditional bootstrap techniques could enable the CI for the effect of pollutants in time series studies to be calculated with optimal coverage, regardless of the degree of concurvity, number of covariates in the model or magnitude of the coefficient to be estimated.

The publication of two scientific papers a few months ago, the first by Dominici et al5 and the second by Ramsay et al,6 showed that standard errors are considerably underestimated in the presence of concurvity.7,8 To eliminate the problem of concurvity, flexible regression methods can be used, such as parametric cubic splines6 or penalised splines, which do not require the use of backfitting to estimate model components. However, parametric cubic splines might not provide sufficient flexibility to capture the trend or seasonality of the series, as the knots have to be established a priori. In this respect, penalised splines17,18 might afford greater flexibility, although, as far as we are aware, no assessment has yet been made of the performance of such smoothers in the context of time series studies.

To eradicate the problem of concurvity, consideration has been given to the application of alternative designs, such as that of case-crossover, which control trend and seasonality by means of design and thereby eliminate the effect of concurvity on the estimation of standard errors.16 Dominici et al19 have introduced a closed form estimate of the asymptotically exact covariance matrix of the linear component of a GAM, and offer a development of the gam.exact package, which is an extended version of gam.20 However, it has also been observed that, in given conditions (with many smoothing functions included in the model), the routine gam.exact can give rise to standard errors very much greater than those that would result from application of parametric techniques.

In this paper, we propose an alternative method based on conditional bootstrap resampling techniques, which provides optimal coverages and affords the following additional advantages: (1) unlike the gam.exact method, which requires the smooth functions to be of the smoothing-spline type, gam.boot is general, in that it is applicable to models with any type of smoother, such as smoothing splines, LOWESS, natural splines, and any combination of same; (2) although our bootstrap method was initially developed to construct CI for coefficients, this approach is easily generalised to non-linear dose-risk relations. Thus, non-linear dose-response relations can be evaluated for pollutants or climatological variables; and lastly, (3) our method could also be extended to the calculation of CI for possible interactions between pollutants and/or climatological variables.

What this paper adds

Concurvity—the non-parametric analogue of multicollinearity—leads to an important underestimation of standard errors, which means that the risk of committing type I error rises by over threefold for concurvity levels commonly present in the data. We propose a method based on conditional bootstrap resampling techniques, which provides optimal coverages of the confidence intervals of the estimated effect in generalised additive models (GAMs) applied to time series data. This methodology affords the following additional advantages over a previous method: (1) our method is applicable to models with any type of smoother (smoothing splines, LOWESS, natural splines, penalised splines) and any combination of same; (2) this approach is easily generalised to non-linear dose-risk relations; and (3) our method could also be extended to the calculation of CI for possible interactions between pollutants and/or climatological variables.

Furthermore, a recent study seems to show that, in cases where the results for a number of cities are combined using hierarchical models, underestimation of standard errors in each of the cities seems to have little effect on the overall results under most conditions.21 In this regard, our method could be generalised to a hierarchical bootstrap method that would start on the basis of standard errors correctly calculated at the first level (city), and then proceed to estimate the overall effects at a multicity level.

While the main limitation of our method might possibly lie in the computational cost entailed, we do not regard this as too high. Furthermore, there is no need to apply the bootstrap method at each stage of building the model: suffice to apply it once the basal model has been constructed and the final effect of each pollutant is to be ascertained.

Indeed, the conditional bootstrap method proposed in this paper could enable the community of air pollution researchers to make a satisfactory calculation of CI in dose-response relations, whether linear or non-linear. To this end, the S-plus gam.boot software, a user friendly application, will be made freely available to readers on request (rocauvigo.es).

Policy implications

  • Up until 2002, most studies that assessed the effects of pollution on health were time series studies that used generalised additive models.

  • In 2002, however, it was shown that, because of concurvity (the analogue of collinearity), false statistically significant associations could be found at a frequency more than three times higher than that established a priori (for example, type I error of 18% compared with 5%).

  • This could imply that some of the findings reported to date on the health related effects of pollution might be erroneous. Accordingly, in time series studies that use generalised additive models, it would be advisable for such data to be assessed for the presence of concurvity and for methods such as that developed in this paper to be applied, to prevent false associations being found between pollution and health related effects.

Acknowledgments

The authors are grateful to Dr Marc Saez for his advice and helpful comments, and to Michael Benedict for his help with the translation of the manuscript.

REFERENCES

View Abstract

Footnotes

  • Funding: Dr Adolfo Figueiras’ work on this project was funded by Health Research Fund (Fondo de Investigación Sanitaria) grants 00/0010-05 and 99/1189 from the Spanish Ministry of Health, Javier Roca-Pardiñas’ work was funded by a grant from the University of Vigo (Vigo, Spain), and Dr Carmen Cadarso-Suárez’s work was funded by grant BMF2002-03213 from the Spanish Ministry of Science and Technology.

  • Conflicts of interest: none.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles