Original articleExploratory causal modeling in epidemiology: Are all factors created equal?
Introduction
To a large degree, epidemiology is about identifying causes of disease and health. Naturally, this has to be done as comprehensively as the current state of the art allows. This means that all possible causes must be considered as thouroughly as possible, including nonphysical and nonchemical ones. In fact, nonphysical and nonbiological exposures were already being investigated even when epidemiology was almost exclusively considering infectious diseases. Since then, social class has repeatedly been identified as being associated with a variety of health outcomes, such as coronary heart disease, cancers, and others 1, 2, 3. This has led to a special branch in epidemiology called social epidemiology. However, less success has been made so far in further exploring the mechanisms linking social variables and health outcomes. Also, it has become clear that within certain social strata, different etiological mechanisms might be in operation [4]. One of the most promising links between social class and disease is lifestyle. Whereas 100 years ago lifestyles were largely determined by social class, this dimension has become less important with secular, political and economical changes and the individualization of lifestyles [5]. In 1995, the World Health Organization claimed that “lifestyle-related diseases and conditions are responsible for 70–80% of deaths in developed countries” [6], citing as examples “cardiovascular diseases, cancer, diabetes, chronic bronchitis, obesity, malnutrition, mental and behavioral disorders, accidents and violence, alcohol and drug dependency, HIV/AIDS and other sexually transmitted diseases, vaccine-preventable infectious diseases, vectorborne and foodborne diseases, and low birth weight.” This points out the enormous role that lifestyle plays in the maintenance of health and the genesis of disease.
One problem with the exploration of lifestyle factors is the way lifestyle variables are often considered in epidemiologic research. If lifestyles are adequately described by patterns of behavior rather than by isolated risk behaviors, like smoking or consuming alcohol [7], then many studies looking only at single risk behaviors would fall short in meaningfully operationalizing their lifestyle exposures. This will even be more so if, in addition to overt behaviors, lifestyles as complex psychosocial concepts are made up of or are associated with cognitive styles, attitudes, or social and other resources and are therefore influenced by a variety of variables (Fig. 1). If this were so, then in studies on lifestyle and health only superficial proxy variables (i.e., single indicators of risk behavior) instead of lifestyle would have been studied in many cases. Therefore, especially in lifestyle-related research, a good deal of conceptual consideration seems to be inevitable to prevent results that are, overall, not really related to lifestyle and lifestyle-related factors. Some progress has been made in this respect 8, 9. Here, another point will be further developed: the type of causation involved in the link between lifestyle and health.
Whereas lifestyle variables are considered in epidemiological studies, the majority of reports on the above diseases do not seem to consider behavioral risks as thoroughly or as adequately as they do consider “traditional” biomedical ones like hypertension or hypercholesterolemia. In particular, psychosocial variables do not seem to be given the same degree of attention as biomedical variables. This is often reflected in the small frequency with which they are explored but also in the way in which they are treated. The bias against psychosocial variables might be one of the big pitfalls of modern risk factor epidemiology [10]. In the present work it will be demonstrated that this situation seems to be aggravated by improper statistical modeling of multiple risk factors in epidemiology. It will be shown that if all potential risk factors are treated as if they had the same proximity to the outcome, those being more distant in nature tend to be deleted from exploratory models.
The reason for this is that there is a fundamental difference between lifestyle-related variables (e.g., certain personality dispositions) and risk factors (e.g., cholesterol level or blood pressure). Variables of the latter type, being either of physiological nature or close to physiological processes, are naturally close to health outcomes (i.e., to effects in causal pathways). On the other hand, psychosocial variables are, by their very nature, rather distant from physiological endpoints. They therefore tend to be neglected or at least underestimated in “risk factorology” [11]. If, however, chained causations are the rule rather than the exception whenever psychosocial variables are involved, then treating them as just another subset of factors in multiple proximate risk factor modeling must be assessed as a systematic way of underestimating their role in etiology. Unfortunately, this seems to have been the case for about the last two decades 4, 12.
Usually, a parallel model of causation is assumed (Fig. 2) when multiple causation is explored in epidemiological analysis. The general model is that the dependent variable E (effect) is linked by a function to one or more independent variables (causes). For example, the role of hypercholesterolemia (for simplicity as all other variables discussed here assumed as being dichotomously either no/0 or yes/1) and hypertension (C1 and C2, respectively) in the genesis of coronary heart disease (CHD) is explored by the following model:
The β0 coefficient describes the baseline prevalence of CHD, neither related to C1 nor to C2. Usually, modeling is done by using multiple linear logistic regression. The proportion p of individuals with a positive outcome E (CHD in this example) is transformed by logit transformation prior to maximum likelihood estimation of the regression coefficients. Odds ratios (ORs) for either of the two risk factors are then computed by exponentiation of the corresponding coefficients. The approach also allows for investigating possible interactions between the independent variables by adding the term β3C1 C2 to the model.
Whereas this procedure might be adequate (although the real etiologic processes are certainly simplified) in the very example, it might not in instances where lifestyle-related causations are explored. If, for example, C1 were not hypercholesterolemia but some behavior (e.g., smoking), some sort of psychological disposition as a variable strongly related to a certain lifestyle (e.g., type A behavior pattern) or a living condition (e.g., job strain [13]), then the sequential causation contained in Figure 3 might be assumed [14].
In this model (true or not with respect to the very example) a distinction has to be made between distal (C1) and proximal (C2) cause. If a sequential causation underlies a given data set, most authors recommend that C2 be deleted from the model 15, 16, 17. The problem is, however, that it has to be known a priori that C2 is an intermediate variable in a causal chain. This, however, is usually not easily the case with empirical data to be explored, and it can be suspected that even considering this very possibility might often be overlooked [18].
The aim of this paper is to demonstrate the consequences of analyzing a sequentially caused relationship with a model assuming parallel and equally proximate causation. Possible alternatives are proposed.
Section snippets
Methods
To demonstrate the role of different models applied to different types of causation in data, Monte Carlo simulations of data with well defined causations were performed. Data sets with 100,000 “cases” (i.e., records) each were generated with the SAS (version 6.12 for Windows 95) RANUNI routine. With the computer clock as source for the seed values, the routine generated random numbers from the uniform distribution on the interval (0,1) using a prime modulus multiplicative generator [19].
In two
Results
After generation, the two data sets were analyzed to check for correspondence with the above methods and parameter settings. To do so, for both data sets prevalences of C1, C2, and E were computed. Also, bivariate relative risks of C2 under C1 = 1 compared with C1 = 0 (RRC2|C1), of E under C1 = 1 compared with C1 0 (RRE|C1), and of E under C2 = 1 compared with C2 = 0 (RRE|C2) were computed. The observed frequencies underlying these computations are contained in Table 3, Table 4. The results of
Discussion
Obviously, not all etiological factors are the same in a causal sense. Only if adequate causal models are used do distal variables have a chance to survive exploration of causal chains. As the examples with synthetic data sets with known causal structure lucidly show, however, the data easily reveal their structure to the correct model. The straightforward and prima vista rather trivial conclusion must therefore be drawn from the present results that adequate models (i.e., models that reflect
References (35)
- et al.
Health lifestyle patterns of US adults
Preventive Med
(1994) Health inequalities in European countries
(1989)The social and economic environment and human health
- et al.
The increasing disparity in mortality between socioeconomic groups in the United States, 160 and 1986.
New Engl J Med
(1993) Prisoners of the proximateloosening the constraints on epidemiology in an age of change
Am J Epidemiol
(1999)Modernity and self-identityself and society in the late modern age
(1991)The world health report
(1995)Health and lifestyle
(1990)Life-style and health behavior
Traditional epidemiology, modern epidemiology, and public health
Am J Public Health
(1996)
The emptiness of the black box
Epidemiol
Creating a new knowledge base for the new public health
J Epidemiol Comm Health
Job strain and cardiovascular disease
Ann Rev Public Health
Antihypertensive treatment, compliance, and quality of lifereview of a little understood relation
J Clin Psychol Med Settings
Statistical methods in cancer research, vol 1the analysis of case-control studies
Epidemiologic researchprinciples and quantitative methods
Modern epidemiology
Cited by (35)
Relationships between sleep and internalizing problems in early adolescence: Results from Canadian National Longitudinal Survey of Children and Youth
2020, Journal of Psychosomatic ResearchCitation Excerpt :All analyses were stratified by participant sex. In cross-sectional and longitudinal models, covariates were entered into the model in a block-wise fashion based on conceptual categories: neighbourhood and peers, family resources and demographics, family processes and functioning, and health-related behaviours [36]. The model was reduced by backward elimination with a p-value of p ≤ 0.20.
The relationships between urinary phthalate metabolites, reproductive hormones and semen parameters in men attending in vitro fertilization clinic
2019, Science of the Total EnvironmentCitation Excerpt :A p-value of <0.05 (two-sided test) was considered statistically significant. In consideration of the exploratory approach of this study (Weitkunat and Wildner, 2002; Wilhelm et al., 2015), we also defined a p-value <0.1 as marginally significant. The overall socio-demographic, lifestyle and clinical characteristics of men who enrolled in this study (between 2015 and 2017) are presented in Table 1.
Incorporating intersectionality theory into population health research methodology: Challenges and the potential to advance health equity
2014, Social Science and MedicineCitation Excerpt :It is often reasonable to expect that dimensions of identity or social position may play a role in a particular health outcome, and may interact with other categories of social position, even where previous research has not yet elucidated the issue. A purely data-driven approach to multiple regression modelling assumes that all candidate factors that play a role in producing an outcome are equidistant from the outcome (Weitkunat and Wildner, 2002). Since this is, in general, not true, this assumption may prioritize retention of the most spatiotemporally proximate factors, though these may in fact mediate the effects of other factors such as sex/gender, race/ethnicity and other categories of social position.
Estimating the direct and indirect pathways between education and diabetes incidence among Canadian men and women: A mediation analysis
2013, Annals of EpidemiologyCitation Excerpt :Unfortunately, many papers examining the relationship between socioeconomic position and diabetes have not distinguished between potential confounders and potential mediators in their examinations of education and diabetes risk. However, this distinction is critical, because the examination of mediators helps to open the “black box” between education level of diabetes risk [15,16]. A handful of studies have examined the potential contribution of mediators in the socioeconomic position to diabetes relationship [17–22].
The major element of 1-year prognosis in acute coronary syndromes is severity of initial clinical presentation: Results from the French MONICA registries
2012, Archives of Cardiovascular DiseasesCitation Excerpt :This approach is more appropriate than stepwise regression to study determinants that are not equally distant to the outcome in the causal chain. This process was inspired by that used more frequently in social epidemiology [17]. Variables interpreted as clinically significant markers of severity of initial presentation and with a strong statistical weight were analysed in greater detail.