Article Text

Download PDFPDF

Hazard ratio funnel plots for survival comparisons
  1. P Silcocks
  1. Trent Cancer Registry, and the NIHR Research Design Service for the East Midlands, Nottingham, UK
  1. Correspondence to Dr P Silcocks, Medical Advisor, Trent Cancer Registry, 5 Old Fulwood Road, Sheffield, S10 3TG, UK; paul.silcocks{at}


Background: Funnel plots are a form of control chart that give a snapshot of many institutions at a particular moment in time. This paper describes how funnel plots may be constructed for survival analyses based on hazard ratios obtained from a Cox regression model with adjustment for covariates and allowance for overdispersion.

Method: Analysis of simulated and real survival data.

Results: It describes how centred hazard ratio estimates adjusted for covariates can be obtained from a Cox regression and gives details of the necessary programming in Stata. Allowance for overdispersion can be made by multiplying the standard errors by a factor based on either the model or the log-rank χ2 statistics. Simulated results and a real example are presented.

Conclusion: Funnel plots based on hazard ratios are easier to interpret than multiple Kaplan–Meier survival plots, and in contrast to funnel plots based on survival at, say, 5 years, are less open to accusations of bias and use more information. The interpretation of such plots may be enhanced by using standard meta-analysis methods. Hazard ratio comparisons may now be added to the repertoire of techniques used by Cancer Registries, Primary Care Trusts, and other commissioners of healthcare.

View Full Text

Statistics from

A range of statistical procedures exists to ensure quality control of manufacturing processes. One of the simplest is a control chart, in which a measure of some critical aspect of quality for each item manufactured is plotted against time, with boundaries to indicate when quality has become unacceptable—that is, when the process is “out of control”. Control charts are closely related to hypothesis tests, in that the control chart tests the null hypothesis that the process is in statistical control.

Although funnel plots have been used as evidence in assessing publication bias since the mid-1980s,1 their value as a form of control chart for graphically comparing institutional performance was suggested by Spiegelhalter2 only relatively recently. The difference between this method and the usual control chart is that the funnel plot gives a snapshot of many institutions at a particular moment in time, as opposed to a measuring performance of a single institution at many moments in time over a defined period. In his paper Spiegelhalter described how control limits could be set for binomial proportions (including various functions of proportions including changes, ratios and ORs), standardised rates, ratios of standardised mortality ratios (SMRs) and continuous response data, but not for survival data.

When survival is to be compared between institutions, comparison at only one time point involves picking an arbitrary time (eg, at 5 years) which may hide a real difference that would be obvious if the whole survival curves were compared.3 4 Moreover the proportion surviving at a particular time has a relatively large standard error, whereas an overall comparison that reflects the whole survival curve, such as one based on a proportional hazards model, might detect a significant difference.3 A point mentioned by one of the reviewers is that some restriction of follow-up, for example to 5 years after diagnosis (or less than this for some purposes), is advisable, to avoid potential biased (or diluted/attenuated) comparisons if later follow-up is available for some groups. While a proportional hazards model may not be suitable for detecting every kind of discrepancy between survival curves, it is reasonable for a control chart in which the null assumption is that the process is “in control” and therefore the survival curves for each institution should differ simply by chance (in addition with a restricted follow-up the proportional hazards model is more likely to hold at least approximately).

While a Cox regression could be used to estimate and formally compare the hazard ratios, two points need to be considered. First, such a comparison uses one of the institutions as a reference category, so to obtain a hazard ratio for each institution the hazard ratios (or rather their logarithms) need to be centred—that is, estimated with reference to a common value. This transformation will also require a transformation of the covariance matrix of the regression coefficients corresponding to the log hazard ratios for each institution. Second the control limits correspond to confidence limits assuming the null hypothesis is true (ie, that survival curves differ only by chance), and therefore the reported covariance matrix of the regression coefficients is inappropriate.


If there are three centres each represented by a 0/1 indicator variable, the proportional hazards model can be represented as:

ln(hazard ratio) = β1D12D23D3

In which D3the indicator for the third (and reference) centre—can effectively be ignored provided its corresponding regression coefficient β3 is constrained to be zero.

The log hazard ratios are to be shifted so that they sum to a value of zero (which, if the process is in control, should also be close to their common mean). The principle involved in shifting them is as follows:


Embedded Image

Then the shifted coefficients are:

Embedded Image

(the last one because β3 is assumed to be constrained to be zero). However, it is also necessary to make a corresponding transformation to the variance–covariance matrix, and in particular the variance–covariance matrix evaluated under the null hypothesis of no between-centre differences. This is because the funnel plot can be thought of as a way of displaying significance test results.

These processes are explained at greater length in the appendix, using matrix algebra together with some additional tricks required to extract the required values from computer output.


Typically allowance needs to be made for covariates that affect case mix—age, stage and treatment being likely candidates. While in the Cox regression these variables can be adjusted for by inclusion as covariates, in the log-rank test it is necessary to adjust for these variables by stratification. Because the covariance matrix under the null hypothesis is obtained using the log-rank test, for consistency the same stratification is employed in the Cox regression to estimate the log hazard ratios.

Estimating and handling overdispersion

If overdispersion is present, the variance of the hazard ratios will be greater than predicted by the above model. The consequence is that at first sight a higher proportion of centres may appear to be out of control than is really the case. Spiegelhalter6 suggested a number of approaches to handling overdispersion—the simplest being estimation of a dispersion factor that multiplies the variance of the quantity for which control limits are to be found. This dispersion factor φ (a single value common for all k institutions) may be estimated as the value of the log-rank χ2 calculated in the usual way, divided by its degrees of freedom. Note that the overdispersion factor modifies the control limits, not the log hazard ratios actually observed and plotted.

The standard approach is to regard values of φ less than 1 as equal to 1 (that is, underdispersion is ignored)—the same assumption as is made in the Der Simonian and Laird7 estimate of between-centre variance employed in meta-analysis, so that the basic estimate of overdispersion is given by:

Embedded Image

where υ is the degrees of freedom for the χ2test.

One approach to guard against assuming overdispersion when this is just a chance finding is to set the estimate

Embedded Image

only if the χ2 is “statistically significant”. However “absence of evidence is not evidence of absence” and in addition the statistical significance will reflect both the extent of overdispersion and the precision of the hazard ratio estimates within each centre.

A better approach suggested by Spiegelhalter6 is to use a Winsorised estimate of φ. This is a method of robust estimation that is less sensitive to outliers. For this the upper and lower most extreme x% of the values in an observed set of values are replaced by the xth and 100-xth percentiles to produce a 2x% Winsorised estimate. The estimated overdispersion parameter

Embedded Image

can be written as:

Embedded Image

where k–1 is the degrees of freedom and the

Embedded Image

denote the squared standard Normal deviates for each regression coefficient (excluding the kth, reference, value). Spiegelhalter suggested Winsorising the z-scores, so that the Winsorised estimate of φ is given by:

Embedded Image

where the asterisk indicates that the Winsorised values are used.

Clearly this process must be used cautiously and typically 10% (5th and 95th percentiles) or at most 20% (10th and 90th percentiles) Winsorising is applied. If only the two extreme observations are Winsorised then 20% Winsorising implies that there must be at least 10 centres, which is close to the advice on the minimum number of degrees of freedom (around 12) needed to estimate a variance reliably.8

Note that the Winsorising is only needed to estimate φ and the un-Winsorised but centred log hazard ratio estimates are displayed in the funnel plot.

It is arguable that a Winsorised estimate of φ should be used routinely because one cause of overdispersion is the existence of unmeasured prognostic variables that are consequently not adjusted for in the analysis. Such an inadequate case mix adjustment is probably the norm in the context of routinely available statistics. One prognostic variable that it is typically possible to allow for is age at diagnosis. When stratifying for age, however, the definition of the age bands used should be declared in advance because, for instance, the resulting funnel plot based on equal width age bands may be very different from one in which age bands are defined on the basis of equal numbers of events.

The main distinction to be made on the basis of the funnel plot is between (a) general variation caused by factors common to all centres such as case mix (and which may be addressed by allowing for overdispersion), (b) the presence of one or two extreme outliers that may require investigation and which may be detected by Winsorising so that such observations don’t influence the estimated standard error, (c) skewness in the distribution of random effects, for which Winsorising may be ineffective as the effects are spread across many centres. For the last one, one possibility would be to use tests for asymmetry in funnel plots devised for meta-analysis.9 10

Practical implementation

In Stata, the covariance matrix of the score statistic can be accessed by using sts test, with the log-rank option.

  1. Assuming that the institutions being compared are labelled 1 to k, with the reference category being the first, delete the first row and column (this is necessary because the covariance matrix is singular).

  2. Invert the resulting matrix.

  3. Augment this matrix with a new first row and column of zeroes, to give the covariance matrix V0 of the regression coefficients under the null hypothesis.

  4. Transform V0 as described above.

  5. Take the square root of each diagonal element of V0 to give the standard error of each regression coefficient under the null hypothesis.

  6. Identify extent of overdispersion and adjust standard errors if necessary.

  7. Calculate control limits.

  8. Apply tests for bias in funnel plot.

Code fragments to perform these steps are given in the appendix.

Once the control limits have been calculated it is easy to plot these with the centre-specific log hazard ratios, using the reciprocal of the standard error under the null hypothesis as a measure of precision,11 and incorporating Stata meta-analysis add-on commands.


Table 1 displays the centred log hazard ratios for simulated survival data for 10 centres, roughly mimicking “old” health districts, with stratification for a case mix covariate in seven bands. The 95% and 99% control limits are also given, with and without adjustment for overdispersion which was induced by including the effect of a centre-level random variable not allowed for in the stratified analysis. For this small data set the Winsorisation did not alter φ. An average of 10 deaths per centre was used in accordance with the 10:1 number of events/variable rule.8

Table 1

Centred log hazard ratios for simulated survival data

Figure 1A shows corresponding conventional survival curves by centre, while in figure 1B these have been adjusted for case mix. The adjusted curves are closer together than the unadjusted ones, but in fact the χ2test is actually more significant.

Figure 1

(A) Cumulative survival by centre, unadjusted for case mix. (B) Cumulative survival by centre, adjusted for case mix.

Figures 2A,B displays funnel plots for the data in table 1 (without and with allowance for overdispersion). Figure 3 on the other hand displays results allowing for overdispersion when this has arisen from a positively skewed distribution for the between-centre random effect. While figure 3 superficially resembles figure 2B, the statistics for the asymmetry tests are non-significant for the data in table 1 (continuity corrected Begg’s test p = 0.118; Egger’s test p = 0.129) but are both significant for figure 3 (continuity corrected Begg’s test p = 0.029; Egger’s test p = 0.001). Figure 4 displays the Egger plot corresponding to figure 3.

Figure 2

(A) Funnel plot of centred hazard rations (no allowance for overdispersion). (B) Funnel plot of centred hazard ratios (with allowance for overdispersion).

Figure 3

Funnel plot with asymmetry (allowing for overdispersion).

Figure 4

Egger plot for asymmetry (data as in fig 3).

Figure 5A,B give a real example, using breast cancer survival data from local authorities (LAs) in the East Midlands. In both cases the results are age adjusted with a Winsorised allowance for overdispersion. In figure 5A the five age bands used have equal numbers of cases, whereas in figure 5B they have equal numbers of deaths. When age bands have equal numbers of deaths the position of LAs 8 and 27 become slightly more extreme, while LA 16 moves in to lie just on the lower 99% limit, when compared with age bands with equal numbers of case. Although the differences between the plots in this example are marginal, this need not be so in general. For figures 5A,B asymmetry tests show conflicting evidence of skewness of the random effect (continuity corrected Begg’s test; p = 0.147; whereas for Egger’s test p = 0.04)—this with 20% Winsorising and allowance for overdispersion after omitting the LA with centred coefficient closest to zero to ensure independence of the estimates.

Figure 5

(A) Local authority funnel plot with Winsorised allowance for overdispersion—equal cases/age band. (B) Local authority funnel plot with Winsorised allowance for overdispersion—equal deaths/age band.


It is not easy to detect outliers in survival by inspection of cumulative survival plots as is clear from figure 1, let alone in the presence of overdispersion. A funnel plot, by converting the survival data into a summary value (the centred hazard ratio) together with defined boundaries, does make this task easier and much less subjective. Covariates that affect survival if unevenly distributed across centres may induce additional variation between institutions or centres; their effect can be allowed for directly by performing a stratified analysis, but if these are not measured, then indirectly by incorporating overdispersion in the significance limits. If both analyses are available then it is possible to see to what extent the stratified analysis has accounted for all the overdispersion (and if not, it implies that still other, unmeasured, variables are having an effect).

Care is necessary not only in choosing the stratification variables for which adjustment is to be made but also in how these are defined. For example, although age bands defined on equal numbers of deaths might be preferred as having equal precision, this choice is debatable and the method of constructing age bands and other strata must be made before performing the analysis and inspecting the results to avoid accusations of massaging the data.

The funnel plot allowing for overdispersion should also be assessed as to whether the factor(s) inducing the overdispersion are common to all centres (symmetric plot), whether the distribution of the factors is uneven (asymmetric plot) or whether just one or two centres are outliers—which may be detected by employing a Winsorised estimate for the boundaries. Assessment of asymmetry can employ the same methods as used in meta-analysis to assess publication bias. The results for figure 5 are consistent with the fact that the Egger regression asymmetry test is said to suggest the presence of publication bias more frequently than the Begg rank correlation test, that is, the former test is more sensitive. For meta-analyses these tests lack power, but it is likely that for institutional comparisons there will be more data available than in many meta-analyses.

Regardless of covariates used for adjustment or stratification, case mix for institutional comparisons may still be problematic if selection criteria of cases for treatment (such as surgery) vary or if there are variations in referral patterns, treatment plans and so forth which may or may not be detectable through overdispersion or funnel plot asymmetry, depending on the size and frequency of the effect in question and the size of the sample being studied. If planning a comparative study, as with any formal investigation, consideration should also be given to the eligibility criteria for cases—for instance whether cases diagnosed elsewhere but referred on would be excluded, which would help increase between-institution homogeneity, but also possibly negate the value of the comparison in highlighting such differences.

The method described here requires relatively little programming to implement, and is capable of being incorporated into more sophisticated routines or batch files for routine analyses. The method can incorporate a routine default adjustment for overdispersion in the estimated hazard ratios where necessary.

Stata code to perform the calculations is available.

What is already known on this subject

  • Funnel plots are increasingly becoming a standard tool for comparing institutional performance and for comparisons of survival.

  • A summary measure based on the hazard ratio that reflects the whole survival experience is preferable.

What this study adds

  • This paper explains the theory of how centred hazard ratio estimates can be obtained from a Cox regression, with funnel plot control limits obtained from the log-rank test, with or without adjustment for overdispersion.

  • The method for obtaining robust (Winsorised) estimates of the overdispersion parameter is also explained and advice is given on stratification for covariates.

  • Stata code is given for practical implementation of the methods, and it is suggested that standard meta-analysis tools be used to assess asymmetry as an aid to interpretation of outliers in the funnel plot.

Appendix A Algebraic details

In order that the log hazard ratios sum to zero (corresponding to the null hazard ratio of 1), the mean of the regression coefficients must be subtracted from each one:

With k institutions, the shifted ith regression coefficient is given by:

Embedded Image

With βk (for the reference category) being zero.

This transformation can be written in matrix form as β* = Tβ where the k×k matrix T has the form:

Embedded Image

If the variance–covariance matrix of the untransformed regression coefficients under the null hypothesis is V0, then the transformed coefficients have covariance matrix given by:

Embedded Image

Where T′ is the transpose of T.

Under the null hypothesis, the covariance matrix of the test statistic has elements of the form:5

Embedded Image

on the leading diagonal,

with off-diagonal elements of the form:

Embedded Image

where the Nij, Nkj are the number of subjects alive at time j in groups i and k; dj is the number of deaths occurring at time j, and Nj is the total number of subjects alive at time j.

The subscript j runs from1 to t, reflecting each death time, while subscripts i, k label different institutions. The Cox model assumes there are no ties on death times, so that all the dj are equal to 1.

It can be shown that the covariance matrix for the regression coefficients is given by minus the inverse of the covariance matrix of the score statistic after deleting the row and column corresponding to the reference category (because the covariance matrix of the score statistic is singular). After inversion, the dropped row and column can be “added back” with entries of zero, because the regression coefficients for the reference category—being a constant (zero)—have variances equal to zero also. After this the covariance matrix of the regression coefficients (under H0) can be transformed as described above.


View Abstract

Supplementary materials


  • All views expressed are personal and do not necessarily reflect Registry policy.

  • Competing interests None declared.

  • All views expressed are personal and do not necessarily reflect Registry policy.

  • Provenance and Peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.