Epidemiological studies evaluate multiple exposures, but the extent of multiplicity often remains non-transparent when results are reported. There is extensive debate in the literature on whether multiplicity should be adjusted for in the design, analysis, and reporting of most epidemiological studies, and, if so, how this should be done. The challenges become more acute in an era where the number of exposures that can be studied (the exposome) can be very large. Here, we argue that it can be very insightful to visualize and describe the extent of multiplicity by reporting the number of effective exposures for each category of exposures being assessed, and to describe the distribution of correlation between exposures and/or between exposures and outcomes in epidemiological datasets. The results of new proposed associations can be placed in the context of this background information. An association can be assigned to a percentile of magnitude of effect based on the distribution of effects seen in the field. We offer an example of how such information can be routinely presented in an epidemiological study/dataset using data on 530 exposure and demographic variables classified in 32 categories in the National Health and Nutrition Examination Survey (NHANES). Effects that survive multiplicity considerations and that are large may be prioritized for further scrutiny.
- Environmental epidemiology
- Epidemiological methods
- GENETIC EPIDEM
Statistics from Altmetric.com
Observational epidemiological studies almost always measure multiple correlated variables in their populations, for example, individual nutrient intake measurements collected through food frequency questionnaires or panels of biomarkers. Furthermore, newer high-throughput measurement of hundreds (to thousands) of non-genetic, environmental variables in individuals can result in an expansion of exposure variables available for epidemiological studies (table 1).1–10 The promise of assessing personal exposomes—the totality of exposure load occurring from birth to death11—includes scaling up the number of environmental exposures measured in individuals to enable a data-driven search for putatively novel exposures associated with disease or other exposures and outcomes of interest through large-scale analyses such as the environment-wide association studies.12–19 One promising technology to scale up the ascertainment of ‘endogenous’ exposures includes metabolomics (table 1) in which hundreds to thousands of small-molecule metabolites are ascertained in human tissue.2 ,4 ,20 Existing epidemiological cohorts have begun to ascertain hundreds of variables of these endogenous exposures such as metabolites and lipids5 ,21 ,22 while smaller studies have ascertained on the order of thousands of chemical analytes.1 Of course, the raw number of exposure-related variables that can be measured are still far lower than what is seen in current day genome-wide association studies and genome sequencing, which assess millions of genetic variants simultaneously. However, the multiplicity burden on the exposome side is already impressive and is likely to get even greater with the advent of new platforms, including sensors that allow continuous streaming of personalised exposure signals.
Testing multiple variables for associations with other exposures and outcomes multiply the prospects of making interesting discoveries. At the same time, this multiplicity can lead to more false positives (due to type 1 error).23 There is a need to take this multiplicity into consideration in designing, analysing, interpreting and communicating epidemiological results in a transparent manner.
Transparency about multiplicity goes beyond the issue of whether statistical inferences should account/adjust for multiplicity or not. The debate of whether p values or other statistical inferences should be corrected for multiple comparisons has been a long one. Arguments have been expressed in the past that there is no need to adjust for multiple comparisons in observational epidemiology.24 This has been a dominant view to-date for most applications of traditional epidemiology. These arguments are stronger when explicit, unique hypotheses are tested that have been prespecified to be of primary interest. However, such prespecification is often non-transparent. In the absence of public preregistration25 of protocols and hypotheses, claims of prespecification can even be dubious and may meet with some healthy scepticism. Therefore, other investigators have highlighted the need for more stringency, for example, by using routinely more stringent thresholds and avoiding focus on statistical significance.26 At the other extreme, for agnostic genomic epidemiological studies, adjusting for the entire multiplicity of genomic comparisons has become standard practice.27 Increasingly, epidemiological studies of exposures will be dealing with numerous variables. Being explicit and transparent about the extent of multiplicity may be important for other scientists to understand the background/context against which the result is reported.
Environmental exposure variables are often densely correlated.14 ,28 ,29 We argue that the density of the correlations in the variables of interest is also useful to convey transparently in any epidemiological study. New proposed correlations (or other measures of effect size) may need to be interpreted differently depending on what correlations (or other effect sizes) prevail, on average, in the field. For example, suppose that a new correlation is identified between two variables and its absolute magnitude is r=0.2 (and it is highly statistically significant at p<0.0001). This new correlation of r=0.2 may need to be interpreted differently if correlations seen for these type of exposure variables are usually very strong (eg, r>0.5 on average) or null or very small (eg, r<0.05) on average.
We offer here an example of how an epidemiological study could convey this essential background information: how many exposures it has measured and what are the typical correlation patterns between exposures against which new results may be placed to gauge relative importance. We use data from the National Health and Nutrition Examination Survey (NHANES) from 2003 to 2004.30 NHANES has captured different domains/categories of exposures, such as urine/serum biomarkers of nutrients, pesticides, hydrocarbons, infectious agents, as well as self-reported indicators of behaviour, such as smoking, physical activity and nutrient intake, and also socioeconomic and demographic variables (table 2). We defined categories of exposures/variables based on documented chemical relatedness (eg, polychlorinated biphenyls compounds, phenols or hydrocarbons), dietary relevance (eg, serum nutrients or self-reported diet questionnaire) or behaviour relatedness (eg, physical activity, smoking or pharmaceutical drug intake), and assay type ascertained from the NHANES (eg, serum or urine-based mass spectrometry).
Table 2 shows the number of exposures assessed for each of the 32 categories. Given the substantial correlation between some variables, the number of independent comparisons is somewhat smaller than the number of variables assessed. To correct for this, we have adopted here a method used in genetic association studies to estimate the number of effective variables that are present after taking the between-variable correlation into account.31 For example, 38 polychlorinated biphenyl analytes were assessed, but after taking into account their correlation, this is equivalent to 24 ‘independent’ variables (table 2). This adjustment matters most where the correlations are largest (eg, polychlorinated biphenyls, dioxins, furans) (figure 1). We also show indicatively what p value thresholds would be used, if one wanted to correct for the number of independent variables in each category. If we assume all 530 variables were independent, correcting for 530 comparisons would result in a p value threshold of 0.05/530=0.0001. We do not intend to settle here the debate of whether p values should be avoided, left uncorrected, corrected for the effective number of variables in the category of interest or whether a different method (eg, false discovery rate using the ‘step-down’32 or permutation-based approach33) should be adopted. However, presenting the number of variables and effective variables (M and Meff, respectively) will be useful in helping other scientists and readers to understand the extent of multiplicity at hand behind an epidemiological dataset that has been analysed to obtain some reported results. Further, reporting of such information can be mandated by journal editors just as authorship and participant roles are documented prior to submission of a manuscript.
Figure 1 shows the magnitude of the absolute correlation coefficients across each category of variables for 29 of the 32 categories that had two or more variables. These correlations are Spearman rank for continuous variables and biserial correlations for binary variables. As documented, most categories had correlated exposures; however, most correlations were not very large (r<0.5 in absolute values). Serum measures of persistent pollutants such as polychlorinated biphenyls, acrylamide, dioxins and organochlorine pesticides had the highest average correlations in their respective categories. Further, dietary measures, such as serum levels of nutrients and self-reported dietary factors, were also densely but modestly correlated with one another, perhaps reflecting the phenomenon of cluster of nutrients that are consumed together. Epidemiological associations may be better understood in the context of other associations in the same field where similar variables are involved. For example, an absolute correlation coefficient of 0.15 between two new nutritional variables is probably not that noteworthy, since the average absolute correlation between two nutritional variables exceeds that value.
The same principle can also be extended to the evaluation of exposures with specific outcomes of interest. For example, in NHANES the median absolute correlation coefficient between the serum nutrients and low-density lipoprotein (LDL)-cholesterol is 0.17. A new proposed correlation between a nutrient and LDL-cholesterol with correlation coefficient less than 0.17 may not be particularly noteworthy, regardless of its level of statistical significance, since 0.17 is around the average absolute correlation that is seen for the average nutrient. Conversely, the median absolute correlation between the nutritional variables and diabetes mellitus (defined as serum blood glucose greater or equal to 126 mg/dL34) is 0.07. Thus, a correlation of 0.17 between a nutrient and diabetes mellitus may be noteworthy since it is greater than two times the size of the typical correlation seen for this field.
One way to place a new proposed association in context is to state what the percentile of its estimated effect is as compared with other associations in the same field using exposure variables and/or outcomes of the same family/type of variables (its ‘relative importance’). In the example given above using correlation coefficients, a correlation of 0.17 with LDL-cholesterol would be at the 50th percentile, while a correlation of 0.17 with diabetes would be at 97th percentile. While we use correlation coefficients for computational convenience, other metrics of effect may be used (eg, standardised effects or regression slopes per unit of exposure) and these metrics are typically easily interchangeable, for example, correlation coefficients may be transformed to an equivalent standardised effect (=0.5ln((1+r)/(1 − r))). For example, a correlation of 0.8 would be equivalent to a standardised effect of 1.1 (a change in 1 SD of one exposure is associated with a 1.1 unit change of the other exposure or outcome of interest). Correlations of 0.17 and 0.07, respectively, are equivalent to 0.17 and 0.07 units of standardised effects. Standardised effects may also be converted to equivalents of ORs.35 Understanding the relative effect sizes may be important for planning follow-up investigations and eventually (for those correlations that have additional causal support) considering what the health policy may be, if any.
Of course, one cannot stress enough that causality cannot be discerned based alone on the level of statistical significance (no matter whether and how it is corrected or not) or magnitude of correlation (or other effect metric) in observational studies. However, we argue that the extra transparency of presenting the multiplicity inherent in an epidemiological dataset and of placing the effect size against the distribution of typical effects seen in the same field might be insightful and mitigate chances for false positive reporting. Many epidemiological datasets are used in reporting dozens and hundreds of separate papers on diverse associations. A report of the data multiplicity and correlation profile of each dataset can be used to inform all of these papers. This profile can be added in a dataset registration record that describes in public view the overall dataset and study design.
Multiplicity can also be induced by the procedure in which analysts produce findings such as different study designs and even choice of adjustment variables that are used to correct for confounding. Confounding is a major hindrance to assessing causality. In this regard, efforts for causal modelling (eg, via directed acyclic graphs36) exist, but a stumbling block is how the structure of the directed acyclic graph is to be chosen. Often, this choice is achieved through a combination of some a priori knowledge and biological plausibility (or biological speculation) and through examining statistical associations (eg, correlations) between arbitrary variables.37 By considering the correlations between all variables (and the effective number of tests), analysts may prioritise for further consideration adjustments for the strongest correlates with their outcome and exposures of interest. These correlates might have been missed by traditional epidemiological thinking that relies on arbitrarily picking potential adjustment variables. Different teams of epidemiologists may think of adjusting for very different sets of variables, even when the same exposure and outcome are assessed across their studies. This is routinely demonstrated in systematic reviews and meta-analyses of epidemiological studies where the set of adjusting factors is almost always markedly different across the combined investigations. Part of this divergence may be due to the fact that not all studies measure the very same confounders. However, even for routinely collected variables, clearly there is large subjectivity among investigators on how they use them. This creates a situation where almost any result can be achieved that would conform to the original expectations of the traditional epidemiologist, provided a suitable set of adjustments and a self-justifying causal model is adopted. This flexibility can transform epidemiology into a champion field for subjectivity, allegiance bias and confirmation bias.
Some traditional epidemiologists who have heard us suggest agnostic approaches for assessments of the exposome take a defensive stance claiming that most epidemiology is (and should be) testing specific focused a priori hypotheses. For example, a reviewer of the original submission of this paper felt that ‘it probably would be extremely complicated and finally useless to produce the same approach with a study such as European Prospective Investigation into Cancer and Nutrition (EPIC). Many environmental studies do have solid a priori hypotheses’. However, a perusal of PubMed and of the EPIC website shows that EPIC (a classic paradigm of a cohort that claims to have prior hypotheses) has already published over a thousand papers and the same applies to other leading cohorts such as the Nurses’ Health Study. While each paper usually does test only a few hypotheses of associations, cumulatively this cohort-specific publication corpus results in investigating several thousands of hypotheses in the very same single cohort dataset. We think it is better to adapt a systematic all-inclusive approach to the data rather than try to separate arbitrarily and one-at-a-time the thousands of ‘solid a priori’ hypotheses from the thousands that no one has happened to pick—until now.
Finally, a caveat is also needed about the meaning and implications of large effects. Large effects are considered to have increased credibility in evidence appraisal schemes such as GRADE.38 Nevertheless, even a very large effect at the top percentile of a field may sometimes reflect an error (‘too good to be true’) or an inflated effect (due to the winner's curse).39 Thus, it would require careful examination and replication in additional epidemiological studies. Effects that remain at the top percentiles across multiple studies may be worth prioritising for further corroboration with other types of biological or experimental evidence. One would need to consider also whether this variable that has the top effect is not strongly correlated with other variables or has strong correlations with many other variables.14 Finally, this approach does not negate also the possibility that many genuine effects, including those that reflect causal relationships, may be small or even tiny.40 Regardless, deciding what is large, small or tiny, and what is significant or not may benefit from knowing the profile of the relevant correlation matrix of studied exposures, its inherent multiplicity and the size of average effects that circulate in the field.
Contributors CJP and JPAI wrote the manuscript.
Funding Chirag J Patel is supported by a NIH National Institute of Environmental Health Sciences (K99 ES023504) and a PhRMA Foundation informatics fellowship. The Meta-Research Innovation Center at Stanford is supported by a grant by the Laura and John Arnold Foundation.
Provenance and peer review Commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.