Article Text

## Abstract

**Background** In non-randomised evaluations of public-health interventions, statistical methods to control confounding will usually be required. We review approaches to the control of confounding and discuss issues in drawing causal inference from these studies.

**Methods** Non-systematic review of literature and mathematical data-simulation.

**Results** Standard stratification and regression techniques will often be appropriate, but propensity scores may be useful where many confounders need to be controlled, and data are limited. All these techniques require that key putative confounders are measured accurately. Instrumental variables offer, in theory, a solution to the problem of unknown or unmeasured confounders, but identifying an instrument which meets the required conditions will often be challenging. Obtaining measurements of the outcome variable in both intervention and control groups before the intervention is introduced allows balance to be assessed, and these data may be used to help control confounding. However, imbalance in outcome measures at baseline poses challenges for the analysis and interpretation of the evaluation, highlighting the value of adopting a design strategy that maximises the likelihood of achieving balance. Finally, when it is not possible to have any concurrent control group, making multiple measures of outcome pre- and postintervention can enable the estimation of intervention effects with appropriate statistical models.

**Conclusion** For non-randomised designs, careful statistical analysis can help reduce bias by confounding in estimating intervention effects. However, investigators must report their methods thoroughly and be conscious and critical of the assumptions they must make whenever they adopt these designs.

- Evaluation studies
- statistics
- confounding
- propensity scores
- instrumental variables
- public health
- randomised trails

## Statistics from Altmetric.com

- Evaluation studies
- statistics
- confounding
- propensity scores
- instrumental variables
- public health
- randomised trails

## Introduction

In our previous paper, we discussed barriers to randomised controlled trials (RCTs) of public-health interventions and suggested alternative design strategies.1 In this paper, we discuss the statistical analysis of data from non-randomised evaluations, dealing particularly with *confounding*. We outline key issues and options available when planning analyses of these data. In practice, the most appropriate statistical approach will differ from case to case, but transparent description of the design and analysis process is essential,2 3 as for RCTs.4

## Causal effects and confounding in non-randomised evaluations of public-health interventions

Evaluations of public-health interventions aim to estimate the *causal effect* of an intervention, by which we mean a quantitative measure of the difference between the level of the outcome had everybody in the population of interest been exposed and the level of the outcome had everybody been unexposed. Readers are referred to Hernan5 for a fuller discussion of causal effects in epidemiology. Our emphasis on public-health interventions also implies we are most concerned with estimating overall (population-level) effects of intervention strategies combining direct and indirect effects.6

RCTs aim to achieve *balance* between treatment and non-treatment groups, meaning that these groups are alike with regard to all factors that might influence the outcome measure (both known and unknown), other than exposure to the intervention of interest. RCTs achieve balance by using chance to determine which people or units receive an intervention, and by applying this process many times over (ie, allocating many units to the two groups).7

Most non-RCT designs also seek to achieve balance.1 Nevertheless, *imbalance* is more likely with these designs, and consequently they risk generating a *confounded* effect estimate: one that mixes the effect of the intervention with other causal effects.8

Consider a study to evaluate the effect of a radio soap-opera designed to encourage contraceptive use in Nepal using data collected from over 8000 women in a cross-sectional survey (see box 1 in our previous paper).1 9 The authors compared contraceptive use among women who reported listening to the programme in the previous 6 months with that among women who did not. An *unadjusted* analysis found that the prevalence of contraceptive use was 12% higher among listeners than in non-listeners (table 1). However, it would be premature to conclude that this reflects an intervention effect. While this study suffered from many potential biases (as do most evaluations) we highlight here the issue of *confounding*. For example, level of education might differ between those who did and did not listen to the soap-opera, and might also, independently, influence contraceptive use (figure 1). Of course, educational level is only one of many potential confounders, and a more complex causal diagram than figure 1 could be drawn to help identify what variables should be controlled.10

### Statistical methods for controlling confounding

#### Stratification and regression

In the traditional approach to controlling confounding, women would be grouped (‘stratified’) according to their educational level—for example ‘none’, ‘attended primary school’ and ‘completed primary school.’ The association between intervention and outcome is examined within each group. Any association within groups cannot be due to confounding by educational status because women in each group have the same level of education, assuming this is correctly measured. If the intervention effect is approximately the same in all subgroups, a weighted average of the stratum-specific estimates provides an adjusted effect estimate free of confounding by that variable.

Regression modelling can include multiple confounding factors as explanatory variables,11 and was used in the Nepal study to control 11 potential confounders, assuming no effect modification. This *adjustment* reduced the estimated effect from +12% to +6% (table 1), consistent with the *unadjusted* estimate being partly confounded (overestimating the true effect). However, this adjusted estimate is unconfounded only if all important confounders are identified and measured accurately.12 13 Since we can rarely, if ever, verify this, residual confounding remains a concern.

#### Propensity scores

Propensity scores are another more recent approach to controlling confounding.14 The following steps are taken:

A regression model is used to identify factors ‘predicting’ exposure to the intervention. The model is used to calculate each individual's predicted probability of, rather than actual, exposure to the intervention (eg, ‘listened to the soap opera’).

Individuals with similar propensity scores are grouped. Within each group, some individuals will actually have been exposed to the intervention and some not. Since individuals in each group had the same propensity to be exposed, the method assumes that actual exposure within these groups was random.

Stratified analysis can compare outcomes between exposed and unexposed individuals within each propensity-score group and by including the propensity score group in the regression analysis, it is possible to obtain an unconfounded estimate of the intervention effect. Alternatively, each exposed individual may be matched with an unexposed individual with the same or similar propensity score and the analysis restricted to these pairs.

When used in the Nepal study, this approach yielded an effect-size estimate of a 9% increase in contraceptive use. Unfortunately, no details were provided of the variables used to estimate the propensity scores; this choice may affect the estimate obtained. Future evaluations adopting this approach should provide these details.2 3

Propensity scores are increasingly popular.15 One advantage is that their use can reduce the number of parameters (ie, variables and categories within these) to be estimated in a regression model. When the number of parameters is large relative to the data available, estimates can become biased and CIs unreliable. In the current example, using propensity scores rather than standard regression reduced the parameters from 25 to 8. In this case, a model with 25 parameters should not have introduced bias,16 since over 2500 women reported using contraceptives. However, in smaller evaluations or those focused on rare outcomes, propensity-score approaches can reduce such problems,14 17 and may be more robust than standard regression when the number of events-per-confounder is small (<8).18

Problems remain, however. In practice, comparisons between regression and propensity-score methods suggest they usually yield similar results.15 19 20 Like standard regression methods, a propensity-score analysis faces the problem of unmeasured or poorly measured variables, since all important predictors of exposure that are causally associated with the outcome must be included. Furthermore, using statistical methods to identify ‘predictors’ of exposure without considering underlying causal relationships might be problematic. Investigators should not include ‘predictors’ which are in fact consequences of exposure, since this will lead to ‘over-adjusted’ models and biased effect estimates.10 21 Alternatively, if multiple predictors of exposure that are not causally associated with the outcome are included, then power may be unnecessarily sacrificed.22

#### Instrumental variables

Instrumental variables are also increasingly popular, purportedly removing the need to identify and measure all potential confounders.23 24 This approach requires an ‘instrument’ that meets the following conditions (figure 2):

it is a cause or proxy for a cause of exposure to the intervention;

it is not a cause of the outcome other than through the intervention; and

it is not associated with any unmeasured confounders of concern in the study population.

Identifying an instrument that satisfies these conditions allows us to generate an unconfounded estimate of the intervention effect (‘effect A’ in figure 2), by comparing the magnitude of the association between the instrument and the outcome (‘effect C’) with that between the instrument and exposure to the intervention (‘effect B’). In the case of the Nepal example, the intervention effect is estimated by dividing the effect (risk difference) of the instrument on the outcome by the effect (risk difference) of the instrument on the exposure. The precise manner in which estimates are calculated differs depending on the situation.23 25

Instrumental-variable analysis may provide different estimates of treatment effect to standard or propensity-score methods.20 In the Nepal example, ‘listens to radio weekly’ was used as an instrument (see figure 2). This analysis suggested an 8.5% increase in contraceptive use associated with the intervention. However, we must question whether conditions 2 and 3 were met. First, listening to the radio weekly (including other programmes) could conceivably have a direct effect on contraceptive practices (violating 2). Second, important unmeasured confounders (eg, social and cultural factors) might also be associated with this instrument (violating 3). This illustrates the difficulty in identifying an instrument. Furthermore, we can never empirically verify the conditions required for a valid instrument. Consequently, ‘the fundamental problem of causal inference from observational data—the reliance on assumptions that cannot be empirically verified—is not solved but simply shifted to another realm’.23 Finally, even if all three conditions above are met, if the correlation between the instrument and the intervention is not strong (the instrument is ‘weak’), the standard error of the intervention effect will be large, and the CI for the effect will be wide.26

#### Using preintervention measures of the outcome variable in analysis

In both randomised and non-randomised evaluations, obtaining measures of the outcome variable prior to the introduction of the intervention can be useful to explore *balance*. These data can also be used to control potential confounding from multiple factors other than the intervention. Such data can be used in two ways:

Treat them in the same way as other potential confounders, and fit a regression model in which the preintervention measure of the outcome is included alongside other potential confounders.

Calculate the change in the outcome, and base the analysis on the difference in the changes in the two groups.

If intervention and control groups are similar with respect to baseline measures of the outcome (ie, there is *balance*), or where small differences arise by chance (as could also happen in an RCT), both approaches provide unbiased estimates of the intervention effect, but regression will provide a more precise estimate and is therefore preferred.27

However, when non-random allocation results in two groups which are drawn from two different populations, and hence are unbalanced at baseline, the two approaches can give contradictory results. This has been described as ‘Lord's paradox’ and was first identified in the context of individual-level data.27 28 29 To illustrate this paradox with an example relevant to our purposes, consider a hypothetical cluster-allocated, non-randomised study of an intervention aiming to lower mean systolic blood pressure (MSBP) among individuals in workplaces by influencing exercise and smoking (100 intervention, 100 control sites). Intervention allocation was non-random determined by stakeholder meetings to identify which workplaces receive the intervention.1 This resulted in a systematic imbalance in MSBP between the two arms at baseline (intervention 120 mm Hg, control 115 mm Hg). Following intervention, measures of MSBP were taken from individuals in all workplaces. MSBP did not change between baseline and follow-up in either the control or intervention workplaces.

In this situation, because blood pressure differed between the groups (was *unbalanced*) to begin with, regression analysis can incorrectly suggest that there was an effect of the intervention in some situations, while an analysis of change in the outcome does not.27 28 29 Here we offer one suggestion for how this paradox might occur and reflect on guidance for evaluators in this situation. Figure 3 shows the results of two simulations of the experiment with preintervention MSBP plotted against MSBP postintervention for all intervention and control workplaces. In figure 3A, we assume that MSBP was measured ‘perfectly.’ Over 1000 simulations, there was no evidence for a difference between the groups from either regression analysis or analysis of changes. However, when we allowed ‘noise’ in the baseline measurements of MSBP (figure 3B), there was evidence of a difference between the groups in regression analysis but not change-scores. Our simulation assumed that each individual had the same true underlying MSBP at both baseline and follow-up, that is, we assume no real change in either group, but that at each time-point, there was independent random measurement error. Our simple simulation suggests that one possible explanation for paradoxical findings when comparing an analysis of changes with a regression analysis is that ‘noise’ results in dilution of the association between pre- and postintervention measures, as identified by linear regression, known as regression-dilution bias.30 This phenomenon reduces the gradient of the regression lines in figure 3B compared with figure 3A. The regression lines for each group are consequently shallower, but are also centred on different means (because of the baseline imbalance) resulting in a vertical gap between the two lines. In regression analysis, this gap is equivalent to the estimated parameter for intervention effect, and might then be incorrectly interpreted in this way.

Further research is necessary to characterise the statistical properties associated with this phenomenon; we have offered only a simplified illustration. Such work is necessary because this situation might plausibly arise in non-randomised evaluations of public-health interventions. For example, we identified an ongoing evaluation of the impact of introducing youth-centres and ‘adolescent-friendly’ clinics on HIV prevalence among 15–24-year-olds in South-African communities. The centres were purposively placed in disadvantaged areas for strategic purposes.31 The evaluation design aims to compare future HIV prevalence through surveys in 11 youth-centres, 11 clinics and 11 control sites, but baseline HIV prevalence was higher in areas where youth centres were placed (15.7%) than controls (13.8%: adjusted OR=1.41 95% CI 0.96 to 2.07).31 It is inevitable that there is imprecision in these site-specific estimates, since they are based on a sample of the population. Given this imbalance at baseline, a future regression analysis using data from a new sample postintervention is likely to be biased by the regression-dilution bias illustrated above. However, for change-scores to produce a valid estimate of the intervention effect, we must assume that the intervention has a constant additive effect regardless of the level at baseline on whatever scale the analysis is performed (for example, log (odds) in the case of logistic regression). The key message of Lord's paradox is, therefore, that when non-randomised intervention and control groups are unbalanced at baseline, any attempt at causal inference will be fraught with difficulty.27 29

#### Imbalance and inference when the number of unit studies is small

Public-health interventions are often delivered to ‘clusters’ of people, and for practical reasons the number of clusters included in an evaluation is sometimes small.1 Cluster allocation must be appropriately taken into account in analyses, and this is relevant to both randomised and non-randomised designs.32 Low statistical power in such studies, including those where large numbers of individuals but only a small number of clusters are enrolled, is a major barrier to statistical inference. We do not seek to review this issue here other than to identify that allocation of a small number of units may also be a reason why imbalance might arise in non-randomised evaluations and thus indicate the need for control of confounding. However, many relevant statistical methods require additional assumptions when cluster numbers are low. We thus reiterate the more general point that while in-depth studies of interventions delivered in a few study units can provide valuable information on process and provide some forms of evidence to inform public-health decision-making,33 they are highly constrained in their capacity to provide quantitative estimates of intervention effect.

#### Time-series studies

When it is not possible to recruit a concurrent comparison group, it may instead be possible to compare each unit pre- and postintervention. However, once again, a ‘fair comparison’ should be made. The before/after approach, and more sophisticated variants of this in which multiple measures of outcome are made over time, controls for sources of confounding that are static over time but not time-varying factors such as maturational, seasonal or secular trends.

‘Interrupted time-series analysis’ requires data on multiple measures of the outcome pre- and postintervention.34 The following steps are taken:

The extent of variation in the outcome over time due to factors other than the intervention (eg, seasonal trends)35 is estimated statistically.

A statistical model is used to predict the ‘expected’ outcome at the end of the intervention period had the intervention not been delivered.

This ‘expected value’ is compared with the ‘observed’ postintervention level to determine the intervention ‘effect.’

For example, the effect of introducing pneumococcal conjugate vaccine (PCV-7) for infants in the USA was evaluated by examining monthly pneumonia admissions (see previous paper, box 1).1 36 There were significant seasonal variation in trends in admission (figure 4). An expected admission rate at the end of the intervention period was obtained by extrapolating available trend data after modelling seasonal fluctuations. The analysis found that the seasonally corrected admission rate by December 2004 was 39% lower than the expected rate (95% CI 22% to 52%) providing an estimate of the effect of PCV-7 (figure 4A).

Interrupted–time-series analysis provides better estimates than simple before/after studies as long as putative time-varying confounding factors are measured and modelled.37 Acute effects of rapid introduction of an intervention are generally easiest to differentiate from other sources of variation in time-series analyses.34 Difficulties arise if an intervention is implemented gradually or has a long latent interval before exerting an effect (eg, the effect of antismoking campaigns on lung-cancer rates). A challenge also arises in deciding how complicated a trend to allow for in estimating expected values postintervention. Non-linear trends can be modelled and provide better control of confounding, but require data for many time-points and may result in less-precise effect estimates. Furthermore, even after modelling trends and allowing for seasonality, there may be autocorrelation between outcome levels at adjacent time-points.38 This autocorrelation can lead to an overestimation of the precision of intervention effects. For continuous outcomes, there are techniques available to take account of this; for counts, a little ingenuity is required.39

## Conclusion

Non-randomised evaluations are essential to inform public-health decision-making where there are clear barriers to the conduct of RCTs. Over two papers, we have discussed design and analysis choices in order to ensure a ‘fair comparison’ is made. Confounding, however, remains a major concern in these studies, and investigators will face more complex problems even than those we have discussed here such as dealing with covariates that change over time.40 Evaluators and analysts have various options to consider but must make careful, informed choices that fit their context. We hope to have aided these choices. Most importantly, as we and others have stressed, investigators should transparently outline the steps taken in design and analysis so that others can judge the value of the estimates produced.

## Acknowledgments

We would like to thank D Elbourne, for her contribution to the development of this paper, and C Grijalva and Elsevier, for sharing the figures reproduced in figure 4. We would also like to thank those who attended a symposium on evaluating public health interventions convened by the London School of Hygiene and Tropical Medicine on 6 November 2006 for contributing insights and thus informing the development of this paper.

## References

## Footnotes

**See Commentary, p 596**Funding JH was supported by an MRC/ESRC interdisciplinary postdoctoral fellowship.

Competing interests None.

Provenance and peer review Not commissioned; externally peer reviewed.

## Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

## Copyright information:

## Linked Articles

- Feature section: interventions
- Feature section: interventions