Criteria for evaluating evidence on public health interventions
- 1Effective Healthcare Australia, School of Population Health and Health Services Research, University of Sydney, Australia
- 2Department of Community Health Sciences, University of Calgary, Canada and School of Public Health, LaTrobe University, Australia
- Correspondence to: Lucie Rychetnik, Effective Healthcare Australia, Victor Coppleson Building, DO2, University of Sydney, NSW 2006, Australia;
- Accepted 30 July 2001
Public health interventions tend to be complex, programmatic, and context dependent. The evidence for their effectiveness must be sufficiently comprehensive to encompass that complexity. This paper asks whether and to what extent evaluative research on public health interventions can be adequately appraised by applying well established criteria for judging the quality of evidence in clinical practice. It is adduced that these criteria are useful in evaluating some aspects of evidence. However, there are other important aspects of evidence on public health interventions that are not covered by the established criteria. The evaluation of evidence must distinguish between the fidelity of the evaluation process in detecting the success or failure of an intervention, and the success or failure of the intervention itself. Moreover, if an intervention is unsuccessful, the evidence should help to determine whether the intervention was inherently faulty (that is, failure of intervention concept or theory), or just badly delivered (failure of implementation). Furthermore, proper interpretation of the evidence depends upon the availability of descriptive information on the intervention and its context, so that the transferability of the evidence can be determined. Study design alone is an inadequate marker of evidence quality in public health intervention evaluation.
Appraisal of evaluative research used in evidence-based health care centres on three major questions. Firstly, is the research good enough to support a decision on whether or not to implement an intervention? Secondly, what are the research outcomes? Thirdly, is the research transferable to the potential recipients of the intervention (individuals or populations)?1
In this paper we ask whether (or to what extent) evaluative research on public health interventions can be adequately appraised by applying well established criteria for appraising evidence about prevention and treatment in clinical practice.2–5 We adduce that these criteria are very useful in evaluating some important aspects of evidence. However, there are other important aspects of evidence relevant to public health interventions that are not covered by the established criteria. We draw attention to these additional aspects of evidence and explain their importance in the assessment of public health interventions. We emphasise the distinction between the appraisal of evidence and the process of making policy or operational decisions on the implementation of interventions. Research-based evidence is only one of several factors to be taken into account in these decisions.
Public health interventions tend to be complex, programmatic, and context dependent. The evidence for their effectiveness must be sufficiently comprehensive to encompass that complexity. The evaluation of evidence must distinguish between the fidelity of the evaluation process in detecting the success or failure of an intervention, and the relative success or failure of the intervention itself. Moreover, if an intervention is unsuccessful, the evidence should help to determine whether the intervention was inherently faulty (that is, failure of intervention concept or theory), or badly delivered (failure of implementation).6 Furthermore, proper interpretation of the evidence depends upon the availability of adequate descriptive information on the intervention and its context, so that the transferability of the evidence can be determined.
To fulfil these requirements, we suggest an expansion of the criteria that are used in clinical medicine for appraising research. We draw on evidence-evaluation schema that were developed for epidemiological and qualitative research, health promotion programme evaluations and health economic evaluations.
For the purposes of this paper, an intervention is defined as a set of actions with a coherent objective to bring about change or produce identifiable outcomes. These actions may include policy, regulatory initiatives, single strategy projects or multi-component programmes. Public health interventions are intended to promote or protect health or prevent ill health in communities or populations. They are distinguished from clinical interventions, which are intended to prevent or treat illness in individuals. Context refers to the social, political and/or organisational setting in which an intervention was evaluated, or in which it is to be implemented. The contextual characteristics that are relevant vary with the type of intervention. Important contextual characteristics for a public health intervention might include factors in the political and organisational environment and socioeconomic or demographic features of the population.
Evaluation is a process of determining the value or worth of something by judging it against explicit, predetermined standards.7Evidence comprises the interpretation of empirical data derived from formal research or systematic investigations, using any type of science or social science methods. This definition of evidence is purposefully circumscribed to articulate the scope of this paper. In our consideration of evidence-based practice, we focus on evidence about likely consequences of interventions, such as effectiveness and cost effectiveness, not evidence about need for services. Thus, we distinguish between data on the cause or scale of a health problem (aetiological studies and needs assessment) and evidence on the implementation and outcomes of interventions. This paper deals with the latter.
Until recently public health epidemiology was chiefly concerned with aetiological hypotheses, rather than evaluative hypotheses. Intervention evaluation has its origins in the social sciences, notably education and psychology.8,9 To strengthen the criteria for appraising evaluative research in public health we have drawn upon a broad-based literature beyond the fields of epidemiology and evidence-based medicine. We acknowledge, however, the limitation of relying on the English language literature.
IS THE RESEARCH GOOD ENOUGH?
If the research is good enough, it will confirm and quantify the causal relation between the intervention and its effects where such a relation exists. Good research will also help us to understand why an intervention appears to be effective or ineffective
Levels of evidence and causality
The assessment of causality for evidence-based health care has mostly depended upon the level of evidence, which traditionally has been defined by the study design used in evaluative research. Study designs are graded by their potential to eliminate bias. A hierarchy of study designs was first suggested by Campbell and Stanley in 1963,10 and levels of evidence based on study design were proposed by Fletcher and Sackett for the Canadian Taskforce on the Periodic Health Examination in 1979.11 Systematic reviews of randomised controlled trials (RCTs) have become widely accepted as providing the best evidence (level 1) on the effects of preventive, therapeutic, rehabilitative, educational or administrative interventions in medicine.12 The concept of levels of evidence has been widely adopted to determine the grade of recommendations for clinical practice, for example, in the recommendations of the US Preventive Services Task Force and the Canadian Task Force on the Periodic Health Examination.2,3 Levels of evidence have also been applied to other areas of evidence-based decision making in health, including prognosis, diagnosis and economic analysis.13
We have collated examples of existing guides for appraising evidence (table 1). Most of these guides are designed to help the user in assessing the factors that determine the existence and strength of a causal relation.
The guides differ in their scope. We have grouped them in table 1 according to their overall aim (left column), adding a summary of how each works and listing some of the criteria used (right column). The examples include critical appraisal checklists for quantitative studies of intervention effectiveness; guides on evaluating reviews or clinical guidelines and rules of evidence to formulate graded recommendations for action. It is standard practice in these guides to define the level of evidence in terms of the study design and to treat this as the primary determinant of credibility. Also included in table 1 are generic guides for determining causal inference in epidemiological research, which encompass criteria that can be applied to appraise causal relations in evaluation research.
Levels of evidence and public health interventions
The assessment of causality for public health interventions has also mostly depended upon the level of evidence.27–29 However, there is persisting controversy about the reliance on the study design as the main criterion of the credibility of evidence. The debate concentrates on the primacy of the RCT for evaluating public health interventions, with respect to (a) the difficulty of conducting RCTs for complex programmatic interventions, (b) the difficulty of interpreting their results, and (c) the tendency to downgrade the contribution of observational studies.
(a) RCTs and complex interventions
Many public health interventions require multiple, flexible and community driven strategies.30–32 RCTs have been described as unable to accommodate the complexity and flexibility that characterises such programmes. They are perceived as being feasible only for evaluating relatively simple, standardised and unvarying interventions and thus as being too rigid and inappropriate for public health settings.33,34
Such criticisms of the RCT are based on a consideration of “classic” RCTs in which the intervention is standardised and the individual is the unit of randomisation. Cluster trials can accommodate communities, schools or other “clusters” as the unit of analysis, and RCTs can cope with non-standard interventions; points that seem to be lost by some trial critics.34 RCTs have a long history of successful application in evaluating the effectiveness of social interventions.35 Given the strength of this study design, the use of a non-randomised study in settings where RCTs would have been feasible represents a lost opportunity.36 Our concern is that evaluators around the world may move away from favouring RCTs in public health for what we see as the wrong reasons; that is, a mistaken belief that experimental designs are only useful for evaluating standard, simple interventions aimed at individuals.
We reaffirm that a well conducted RCT is the best (albeit sometimes impractical) study design for determining a causal relation between an intervention and its putative outcomes. However, study design alone cannot suffice as the main criterion for the credibility of evidence about public health interventions.
(b) Interpretation of study results
Deficient a priori criteria for the adequacy of evidence on public health interventions have led to disagreements about interpretation of results, particularly negative findings.37,38 Some current appraisals of evidence do not assist in making a distinction between failure to demonstrate underlying effectiveness and good evidence of ineffectiveness.
Negative findings warrant careful exploration. Has the research failed to find an effect where one exists (evaluation failure)? Or is there truly no effect (programme failure)?6 In the event of programme failure, is the failure attributable to an inherent inadequacy in the intervention (that is, a failure of intervention theory), or attributable to poor implementation? The authors of some systematic reviews have acknowledged that crucial factors such as the stability of the programme being evaluated, the quality of the implementation, or the adequacy of the outcome measures relative to programme goals, were not taken into account.39 Without this information one cannot conclude that negative results mean that an intervention is ineffective. However, evidence of adequate implementation, and other measures to monitor the evaluation process, are important regardless of whether the findings are negative or positive.
The disagreements about interpretation of the results of community-based trials have been the basis for recommendations to expand the scope of evaluation methods for community programmes.40–43 These recommendations need to be supported by parallel developments in the criteria used to appraise the quality of evidence on public health interventions.
It has been proposed that evaluation designs should be more prudently and strategically sequenced to a programme's stage of development and to the available evaluation resources.6,44–47 Expensive randomised trial designs should be used only after research using simpler and cheaper designs has yielded satisfactory results regarding the feasibility of the intervention. Thus an RCT design may be best used to test a causal hypothesis, after satisfactory pre-post single group design has been conducted, and assurance has been obtained that the measuring instruments satisfactorily capture programme implementation processes and outcomes.44
Specification of the theoretical basis of the intervention can also improve the credibility of outcome measures, and accords with a trend towards making the hypotheses and assumptions underpinning public health interventions more explicit.48,49 Intervention theories should be explicit and plausible. Explicit theories allow us to determine whether they are commensurable with the impact and outcome measures that have been adopted to evaluate that intervention, and whether an appropriate method was used to analyse those measures.50 The trend towards identifying the anticipated causal pathway of an intervention (the “mode of action”) is redressing the pragmatic “black box” use of epidemiology that placed more weight on research methods and outcomes than on intervention theory.51,52
Multi-dimensional approaches are available for evaluating outcomes research.53 Table 1 includes a recent guide 20 for assessing evidence on intervention effectiveness on three dimensions: the strength of evidence, which is determined by a combination of the study design (level), methodological quality and statistical precision; the magnitude of the measured effects; and relevance of the measured effects (as observed in the evaluation) to the implementation context. Such approaches are in tune with the epidemiological tradition of using multiple criteria to assess causal associations or causal inference (also listed in table 1). For the purpose of evaluating evidence on public health interventions, such an approach could be expanded to consider issues of intervention theory, intervention implementation, and monitoring in the evaluation process.
(c) The contribution of observational studies
Observational studies may represent the most feasible, acceptable and/or appropriate study designs for evaluating health interventions,54 including public health interventions.55 While RCTs (notably cluster RCTs) can be designed to evaluate even complex public health programmes, often they are not feasible because of practical or resource constraints. Consequently well conducted RCTs are rare in public health. The implications of our reliance on observational evidence are threefold. We need to (a) better discriminate between different observational designs (b) improve our understanding of the bias in observational studies and (c) be pragmatic about the importance of study design relative to other dimensions of quality in evaluation research.
There are many useful observational designs available, including quasi-experimental designs, but guides for appraising evidence about clinical interventions do not discriminate among them. Their relative strengths and weaknesses are well described.8,10 The different study designs provide for alternative methods of assembling comparison groups, and of timing the implementation of an intervention in relation to the timing of various measurements. Thus, for example, replicated findings from interrupted time series designs (with repeated measurements before and after an intervention), by different investigators in different settings, may provide convincing evidence that an intervention is effective.
The relative validity of observational studies compared to RCTs has been the subject of much, ongoing debate among experts in evaluation methodology. Some studies have indicated that good observational designs can produce similar findings to those produced by RCTs, although “more empirical evidence is needed”.56 Conversely many of the comparative studies have themselves been critiqued for being methodologically flawed and highly confounded.57 In an attempt to overcome such problems a recent study constructed randomised and non-randomised comparisons from a single dataset.58 The authors concluded that non-randomised designs introduced “serious” and “unpredictable biases” that can lead to “both over- and under-estimates” of intervention effectiveness.
The potential for bias in observational studies will mean that their classification to lower levels (compared with RCTs) in the hierarchy of study design may be upheld. We do not seek to overturn such classifications and recognise that study design is highly important in evaluating evidence in public health. We do question however, the relative weight that is given to study design compared to other aspects of quality when appraising programme evaluations (as outlined in this paper).
Finally, appraisals of evidence quality are important in so far as they influence decisions about public health policy or practice. Care is needed that the use of evidence hierarchies to compare the potential for bias between study designs does not translate into unrealistic or overly expensive demands for level 1 or level 2 evidence, particularly if there is good or adequate level 3 evidence to inform a decision. In connection with this point, Kreuter aptly quoted Voltaire's aphorism in stating that “the best is the enemy of good” (Kreuter, 11th National Health Promotion Annual Conference, Perth, 1999).
WHAT ARE THE INTERVENTION OUTCOMES?
An evaluation of the adequacy of evidence about an intervention should include an examination of the range of outcomes considered. The evaluation criteria should help to determine whether the measured outcomes encompassed (a) the interests of people who might be involved in deciding on or delivering the intervention and (importantly) those affected by it; (b) unanticipated as well as anticipated effects of the intervention, beneficial or otherwise; and (c) the efficiency of the intervention, as well as its effectiveness.
(a) Identification of outcome information needed by important stakeholders
Given the social and political nature of public health, an appraisal of evidence should determine whether the outcome variables cover the interests of all the important stakeholders, not just those who conduct or appraise evaluative research.60 Important stakeholders include those with responsibility for implementation decisions as well as those affected by the intervention. Some of the latter may be in disenfranchised groups, and it is not always clear whose interests have been (or should be) considered in evaluative research.61,62 This recommendation is in keeping with a long tradition in the social sciences known as utilisation focused evaluation.63
Identification of the appropriate range of outcomes that should be included in a piece of evaluative research is one part of a pre-evaluation procedure known as “evaluability assessment”.64,65 This was developed in the programme evaluation field more than two decades ago and has been popularised widely within health promotion.44 Evaluability assessment requires a priori agreement about the successful outcomes of an intervention from important stakeholders' perspectives, including agreement on the types of evidence deemed to be adequate to reach a conclusion on the value of an intervention, and the questions to be asked in evaluating the intervention.66
(b) Anticipated and unanticipated effects
Public health programmes often combine biomedical, educational, social and policy strategies that have many possible outcomes, such as changes in health states and determinants of health, processes, and characteristics of individuals, communities and environments.67 These outcomes may be anticipated or unanticipated, and they may be intended or unintended. Unintended effects may be as desirable as, or more desirable than, the intended effects of the intervention. Conversely, unintended effects may detract from the intended effects to such an extent that assessment of the success of the intervention warrants revision. Evaluative research that records only the intended outcomes of an intervention may fail to detect its other positive or negative consequences. The methods of “goal-free evaluation”68 are available for detecting unintended programme effects.69
(c) Efficiency of interventions
Evidence-based health care is intended to take account of efficiency as well as effectiveness, although to date efficiency questions have not been emphasised in evidence-based medicine.70 The appraisal of evidence on public health interventions must inevitably determine whether efficiency has been assessed, and if so, how well. Examples of evidence-evaluation checklists that have been developed for appraising economic evaluations are listed in table 2. These include a guide to common standards so that evaluations from different settings can be compared; checklists for appraising published articles; regulatory guidelines; and ethical principles of good practice in economic evaluations.
IS THE RESEARCH EVIDENCE TRANSFERABLE?
Evidence-based decisions on the value and applicability of an intervention draw on knowledge of the effectiveness of an identical, similar or analogous intervention, usually carried out and evaluated in a different setting at a different time. To assess the transferability of evidence about an intervention information is needed on (a) the intervention itself (b) the evaluation context, and (c) interactions between the intervention and its context. A major limitation of traditional appraisal criteria is their inattention to adequacy of these aspects of the evidence.
(a) Information on the intervention
Public health interventions are rarely a standard package. To assess transferability, information is needed on the multiple components of an intervention. This should include details about the design, development and delivery of the various intervention strategies. Information is also needed on the characteristics of people for whom the intervention was effective, and the characteristics of those for whom it was less effective or even harmful. For many interventions, knowledge of factors that influence its sustainability and dissemination will also be important.44,67,84–87 These factors may be inherent to the way intervention strategies were delivered, or relate to the context in which they were implemented (see below). The availability of such information is a marker of the quality of evidence on public health interventions.
(b) Information on the context
The social, organisational and political setting (or context) in which a public health intervention is implemented usually influences the intervention's effectiveness.85,88 It is important to distinguish between components of interventions that are highly context dependent (for example, a public education campaign to enhance immunisation uptake) and those that may be less so (for example, the efficacy of the vaccine itself among healthy infants). Contextual factors that influence the generalisability of evidence about interventions include literacy, income, cultural values and access to media and services.89 Yet much published evidence on public health interventions does not include description of contextual variables or assess their impact on measures of effect.90,91 We should note that the lack of contextual information is also a weakness of evidence on medical interventions. For example, reports on surgical procedures often omit information on training, skill and experience of operators, or even proxies such as hospital throughput. Established critical appraisal criteria do not draw attention to this deficiency.
(c) Information on interactions between the intervention and the context
Contextual factors often interact with interventions, even simple interventions such as educational programmes.92 Effect modification may arise from components of an intervention (for example, the skill and experience of the professional public health personnel responsible for the intervention), and/or the context (for example, cultural characteristics of the community in which the intervention was studied). Interactions between interventional and contextual components can have two implications. Firstly, they are likely to affect the transferability of the intervention and they also make an assessment of its transferability more difficult. Secondly, interactions greatly complicate attempts to pool the results of different studies. Criteria for assessing evidence on public health interventions should therefore determine whether interactions have been sought, understood and explained. Where strong interactions are known to exist between an intervention and its context, it can be preferable (and more informative) to explore and explain their effects, rather than pooling the findings.
The information needed to assess the transferability of evidence is often drawn from research that uses a combination of different types methods, including observational, multilevel and qualitative methods (submitted data). Qualitative research can also enrich the understanding of intervention effects 94,95 and guide systematic reviews.96
Standards for conducting qualitative investigations are widely available.97–99 Recent interest in critical appraisal has stimulated the publication of several schemata for appraising qualitative research as a source of evidence.100–105 Examples of guides for evaluating qualitative evidence, and evidence from health promotion programme evaluations that focus on process and context information, are identified in tables 3 and 4.
EVIDENCE APPRAISAL AND PUBLIC HEALTH DECISIONS
Decisions about public health interventions should be based on a broad assessment of the strengths, weaknesses and gaps in the evidence. Reliance on levels of evidence alone to grade recommendations for action can attenuate public health decisions. For example, decisions that are mainly determined by criteria of evaluation study design will favour interventions with a medical rather than a social focus, those that target individuals rather than communities or populations, and those that focus on the influence of proximal rather than distal determinants of health.55,108,109
It is also important to recognise the relative capacity of competing stakeholders (in a decision process) to generate evidence. Certain types of interventions (for example, pharmaceutical) are more likely to be supported by high quality evidence, simply because more resources are available to conduct the evaluation and produce that evidence (rather than because the interventions are better). In addition, “best” evidence is often gathered on simple interventions and from groups that are easy to reach in a population.108 Thus conversely, little level 1 evidence exists on interventions for disadvantaged groups. This suggests that considerations of equity should temper the rigid application of rules of evidence in formulating recommendations for the use of public health resources.108
Critical appraisal guides that identify and appraise multiple dimensions of evidence 20 permit greater scope for issues of relevance and transferability to be taken into account when formulating recommendations for practice. Yet it is still important to distinguish between a systematic and rigorous appraisal of available evidence, and the complex, socio-political process that determines policy and practice decisions. These distinct judgements will often be made by separate groups and are guided by different criteria and values.
Decisions about practice require a weighing of multiple factors such as the perceived magnitude and importance of the problem, the potential effectiveness and harms of the intervention, the feasibility of its implementation, its political acceptability, and the public demand for action. Different interest groups may advocate for competing recommendations,110 and recommendations based on the same evidence may change over time or change between contexts.111,112 In policy debate, a lack of good quality information about a problem can be interpreted as meaning that the problem is unimportant.113 As the notion of evidence-based policy gains substantial political currency,114,115 there is an analogous risk that a lack of high level evidence about the effectiveness of an intervention will exclude potentially valuable interventions from consideration. A clear distinction between criteria for evaluating evidence to determine what we know (and what we don't know) about public health interventions, and the context dependent and often variable factors that determine local priorities, may allay such concerns.
The evaluation of evidence about public health interventions should examine not only the credibility of the evidence, but also its completeness and its transferability
The criteria used for critically appraising evidence need to reflect contemporary standards for planning and evaluating community-based programmes.
The term “best quality” evidence should refer to evaluative research that was matched to the stage of development of the intervention; was able to detect important intervention effects; provided adequate process measures and contextual information, which are required for interpreting the findings; and addressed the needs of important stakeholders.
The appraisal of evidence about public health interventions should encompass not only the credibility of evidence, but also its completeness and its transferability. The evaluation of an intervention's effectiveness should be matched to the stage of development of that intervention. The evaluation should also be designed to detect all the important effects of the intervention, and to encapsulate the interests of all the important stakeholders.
These elements of evaluation have not yet been accepted as criteria for appraising evidence on public health interventions, although they are widely accepted in standards for planning and evaluating community-based programmes. We advocate their incorporation into criteria for appraising evidence on public health interventions. This can strengthen the value of evidence summaries and their potential contribution to the processes of public health advocacy and social development. Best quality evidence in public health is vital, but we should refrain from using the phrase “level 1” or “best” evidence synonymously with what is only one aspect of evidence quality, that is, study design.