The epidemiological concept of confounding has had a convoluted history. It was first expressed as an issue of group non-comparability, later as an uncontrolled fallacy, then as a controllable fallacy named confounding, and, more recently, as an issue of group non-comparability in the distribution of potential outcome types. This latest development synthesised the apparent disconnect between phases of the history of confounding. Group non-comparability is the essence of confounding, and the statistical fallacy its consequence. This essay discusses how confounding was perceived in the 18th and 19th centuries, reviews how the concept evolved across the 20th century and finally describes the modern definition of confounding.
- methodology me
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
To an unprepared mind, the terms ‘confounding’ or ‘confounder’ do not immediately evoke the consequences of comparing groups of people who differ on determinants of the studied outcome. Expressions suggesting the imbalanced distribution of multiple independent causes across groups would have conveyed the meaning more directly, but epidemiology retained the verb confound. The reason is that the theoretical work on the concept of confounding started with a description of a statistical fallacy: under some conditions, the effect of an exposure could be similar in each stratum of a third variable, but when these strata were pooled, it was as if the effect of the exposure of interest got ‘mixed’ with that of the third variable.1 The fallacy was thus aptly named confounding, from an old usage of the Medieval Latin verb ‘confundere’, which meant mixing.2 Earlier attempts to trace the history of confounding essentially focused on this conceptualisation of confounding as a fallacy.2 3
Compared to earlier reports, the present essay expands the history of the concept known today as epidemiologic confounding in the phases preceding and following the time when it was primarily viewed as a fallacy. After discussing how confounding was perceived in the 18th and 19th centuries, the essay reviews how the concept evolved across the 20th century and finally describes the modern definition of confounding.
The methodological approach driving this history of confounding is inspired by Piaget's genetic epistemology.3 4 The leading idea is that scientific disciplines are in continual construction, formalisation and organisation. Their methods and concepts are commonsensical when the discipline first appears, but become increasingly theoretical and abstract as the discipline acquires experience and addresses questions of increasingly complex nature.4
The genetic epistemology approach assumes that the concept that was eventually named confounding: A. had a history, which started with commonsensical observations, B. evolved into an increasingly abstract, formal and overarching concept, and C. is still evolving today.
To trace the history of confounding, this essay uses the four phases (preformal, early, classic and modern) previously identified in the history of epidemiological methods and concepts.3 It also focuses on the theory of confounding, and does not cover the statistical approaches to confounding-related issues, (eg, collapsibility5) methods of adjusting for confounding, or the history of causal inference, which, even though closely related and overlapping at times with that of confounding, has a broader scope.5 6
Non-comparability of groups is the most primitive epidemiological concern to which the modern concept of confounding can be traced. When, in 1747, Lind7 compared the efficacy of candidate treatments of scurvy, he made sure his six experimental pairs of seamen were comparable, a priori, in terms of determinants of scurvy lethality such as disease stage, food and air quality.8
In the 19th century, group non-comparability was a formidable criticism to epidemiological studies.9 Hence the emphasis put by John Snow10 on the comparability of the 1854 London clients of the Southwark and Vauxhall water company, who drank polluted Thames water and experienced high mortality from cholera, with those of the Lambeth Company, who received relatively sewage-free water and experienced low mortality from cholera. Both groups, Snow insisted, were similar in social standing, housing space and occupations. He specifically investigated neighbourhoods with mixed water supply, in which adjacent houses could be supplied by different water companies.11 But for Snow's contemporaries, like Farr12, who believed that cholera was due to air pollution, that is, miasma,13 the two companies served clients who differed substantially in ways thought to be relevant to the occurrence of cholera, such as elevation above sea level, family income and quality of housing. How could Snow confound his critics? He could only speculate that the clients of the two companies must have been comparable as a large number of people (‘no fewer than 300 000’), ‘were divided into two groups without their choice, and, in most cases, without their knowledge’.11 Retrospectively, we understand that the contagionist Snow was arguing against the miasmatic idea that the two client populations were comparable on some miasma-related confounding characteristics. Snow claimed that comparability was plausible, but he lacked the techniques developed subsequently to achieve comparability analytically or by design.
Early theory of confounding
Indeed, epidemiologists of the first half of the 20th century began to formally address the criticism of non-comparability.9 They implemented new techniques such as random allocation of treatment14, restriction of the study sample15, standardisation of risks and rates15 16 and exposure propensity scores.17 18 They improved the epidemiological study designs, such as retrospective cohort studies16 19 and case—control studies.17 20 All of these efforts aimed to design studies and/or analyse the data in ways that purposefully optimised comparability on alternative causes of the studied outcome. Surprisingly, the first definition of the concept we refer to today as ‘confounding,’ did not follow from this line of efforts to achieve balanced comparisons.
As shown in table 1, Yule,25 in 1903,1 Greenwood,26 in 1935,21 and Hill,27 in 1939,22 apparently independently described a fallacy resulting from pooling data when a third variable was not equally distributed in the compared groups. Yule used the imaginary example of an attribute, not transmitted by fathers to sons or by mothers to daughters, but that showed ‘considerable apparent inheritance’ when the data of fathers, sons, mothers and daughters were analysed together. Greenwood,21 imagined an immunisation experiment, in which risk of death was similar among the inoculated and the non-inoculated in first and second groups of patients, but pooling the two groups together resulted in a spurious protective effect of the inoculation. In Hill's example, a treatment did not work for men or women, but reduced mortality when the male and female data were combined.22
Yet, Yule, Greenwood and Hill do not seem to have viewed the fallacy as a common issue in population studies, and did not suggest computing a weighted average of the stratum-specific effects to bypass it. Apparently, their examples went into oblivion. The subsequent phase of the history of the epidemiological concept of ‘confounding’ appears to be an offshoot of discussions related to the modelling of interactions.
Classic theory of confounding
Fisher28 used the verb ‘confound’ in 1926, to describe the implication of discarding some high-order interactions in the analysis of data from studies with factorial designs.2 Precision could be improved, but the sacrifice of interactions would amalgamate strata, eliminating and therefore ‘confounding’ the manifestation of some of the underlying heterogeneity of effects.29
In my view, Fisher was using the term ‘confounding’ in the same way it had been used earlier by the English philosopher Mill, that is, as the consequence of ignoring causal interactions. For Mill, confounding meant ‘intermixture of causes,’ which he defined as two or more causes, ‘modifying the effects of one another’.30 Mill was referring to a mixing of effects that were heterogeneous across strata of one of the causes. This was different from Yule's fallacy, in which the exposure had a single effect, which was similar in all strata of the extraneous factor, except for being confounded in the pooled effect.
It is at that point that Simpson,31 building on Fisher's work, made the contribution now known as Simpson's paradox. Simpson showed that discarding the interaction terms could impact the estimation of the pooled effect even when the stratum-specific effects were homogeneous. This could actually leave ‘considerable scope for paradox and error’. He gave the example of an imaginary trial, (see table 1) in which the treatment homogeneously increased the survival odds both for males and females as separate groups, but had no effect when genders were pooled.23
Simpson posited that, for second-order interactions to be ignored, the third variable had to be independent of the treatment among the non-outcomed and independent of the outcome variable among the unexposed. Otherwise, stratification had to be preserved. This became the core of the classic epidemiological definition of confounding.
From 1959 on, expressions appear in the epidemiology literature, which evoke Yule's fallacy or Simpson's paradox without explicitly referring to them. Papers and textbooks mention ‘indirect associations’,32 and ‘misleading associations,’ produced by ‘extraneous factors’,33 or, ‘indirect associations generated by factors related to both outcome and exposure’.34 The term ‘confounding’ itself began to appear in epidemiological articles and textbooks in the 1970s.35–37 Its usage may have reflected the influence of the sociologist Kish who had defined the term in 1959.2 38
Around 1980, it was specified, in addition to the two conditions formulated by Simpson, that the third variable should not mediate the relation of exposure to outcome.39 40 This third condition highlighted the need for a priori, non-statistical knowledge about the relationship of the potential confounder with the other studied variables.41
Overall, table 1 shows the similarity of the quantitative examples used to illustrate confounding as a mixing of effects, from Yule1 to Rothman,24 that is, across most of the 20th century. Yule's expression of confounding as, ‘a fallacy caused by the mixing of records (ie, strata)’1, is analogous to Rothman's, ‘on the simplest level, confounding may be considered as a mixing of effects’.24
Modern theory of confounding
The classic definition of confounding had weaknesses. It was derived from the relation of additional variables to exposure and outcome, and not from the characteristics of the studied association, such as non-comparability. A variable could meet the classic definition and not be a confounder.42 Matching for a confounder had different implications in cohort and case—control studies.39 43 Screening for confounding by comparing the stratum-specific and the pooled effects could lead to different conclusions based on whether one used risk ratios, risk differences or ORs.39
The modern definition of confounding was inspired by work in the analysis of randomised controlled trials. In 1923, Neyman44 defined a causal effect as the impossible contrast between the outcome of a single unit, say an individual, if assigned the experimental treatment, and the outcome of that same individual if concurrently assigned the reference treatment.45 In 1974, Rubin stated the fundamental problem of effect identification in terms similar to those of Neyman.46 If ‘y(E)−y(C)’ is the effect of treatment E versus control C on outcome Y, and assuming y(E) and y(C) need to be measured at time 2 on the same person: ‘The problem in measuring y(E)−y(C) is that we can never observe both y(E) and y(C) since we cannot return to time t1 to give the other treatment’.46
Each individual can be observed in only one treatment state at any point in time. Of the two potential outcomes (ie, under the experimental or under the reference treatments), one is observed, and the other needs to remain hypothetical. Thus, as described by Copas in 1973,47 there could be four individual types of potential outcome pairs for a dichotomous treatment (A and B) followed by a dichotomous outcome (success or failure) according to whether a subject would respond to A and B, A but not B, B but not A, or neither A nor B.
There is literary evidence of the ongoing epidemiological reflection about potential outcomes in the 1980s,48 49 but it wasn't until a 1986 paper by Greenland and Robins that the potential outcome approach to confounding was made widely accessible to epidemiologists.42 In Greenland and Robins' paper, the potential outcome model was confined to deterministic risks (ie, risks that can equal either 0 or 1) but it differed from previous discussions46 50 because—as shown in table 2, which imitates a table in Greenland and Robins' 1986 paper—it used the four ‘causal’ types47 dubbed ‘doomed’, ‘exposure causative’, ‘exposure preventive’ and ‘immune’.
The example in table 2 shows that if a centenarian lady has been vaccinated and does not get the flu, she has no way of knowing whether she was susceptible and the vaccine was ‘preventive’, or whether she is naturally ‘immune’. Similarly, if a non-vaccinated person does not get the flu, she cannot know whether she would have avoided the flu had she been vaccinated. She could be ‘doomed’ or she could lack the protection of the ‘preventive’ vaccine. The effect of the vaccine cannot be identified, or its parameter estimated, without knowing both potential outcomes, under vaccination, as well as under no vaccination. This is the logical impasse mentioned by Rubin46: both potential outcomes cannot be observed simultaneously in the same person.
Consider now, two large randomised groups of N subjects each, and that in each group, the N subjects are d doomed, c causative, p preventive and i immune to flu, where d+c+p+i=N. One group gets the vaccine and the other does not. The risk difference of getting the flu is identifiable as groups are large and comparable with respect to their potential outcome types, assuming there were no gross violations of the assignment protocol,51 misclassification, or losses in the follow-up. They are, in Greenland and Robins' terminology, ‘exchangeable’. The risk of flu is RV=(d+c)/N in those vaccinated, and RNV=(d+p)/N in those not vaccinated. The risk difference between the non-vaccinated and the vaccinated is RD=RV−RNV=(p−c)/N. The risk difference only ‘partially identifies’ the vaccine effect, because a zero effect could be due to the vaccine causing as many flu cases (c) as it prevents (p). ‘Full identification’ is possible if, for example, the vaccine does not contain killed or weakened influenza virus, but only split particles of the flu virus, which cannot cause flu. Under this scenario, there are no ‘c’ subjects and the risk difference is simply (p/N), that is, if c=0.
However, if the groups were not at least ‘partially’ exchangeable, as if, for example, there were more ‘doomed’ (eg, centenarians with lethargic immune response) in the vaccinated group than in the non-vaccinated group, the ds would not cancel out, and the risk difference would be confounded.
This theory of confounding derived from potential outcome contrasts has been generalised from randomised to observational studies,52 it has helped to formally distinguish confounding from selection bias53 and has recently been revisited by its authors.54
From a broad historical perspective, the modern definition of confounding based on potential outcome contrasts has reinstated group non-comparability as the essence of confounding, establishing the statistical fallacy as one of its consequences.
What is already known on this subject
Earlier attempts to trace the history of confounding focused on the period when confounding was conceptualised as a fallacy resulting from mixing the effect of the studied variables with that of a third variable.
What this study adds
The present essay expands the history of the concept known today as epidemiologic confounding to the 18th and 19th century when it began to be viewed as an issue of non-comparability between groups.
It also explains how the modern definition of confounding based on potential outcome contrasts has reinstated group non-comparability as the essence of confounding and established that the statistical fallacy, from which confounding draws its name, is a consequence of group non-comparability.
This essay was presented as an invited lecture at XVIII IEA World Congress of Epidemiology and the VII Brazilian Congress of Epidemiology, in Porto Alegre, 21–24 September, 2008. I am indebted to Sander Greenland, Sharon Schwartz, Olli Miettinen, Jan Vandenbroucke, Jamey Robins, Timothy Lash, Paolo Vineis and Raj Bhopal for discussions and comments on the many earlier versions of this manuscript.
Competing interests None.
Patient consent Obtained.
Provenance and peer review Not commissioned; externally peer reviewed.