Article Text

## Abstract

A statistical analysis combines data with assumptions to yield a quantitative result that is a function of both. One goal of an epidemiological analysis, then, should be to combine data with good assumptions. Unfortunately, a typical quantitative epidemiological analysis combines data with an assumption for which there is neither theoretical nor empirical justification. The assumption is that study imperfections (eg residual confounding, subject losses, non-random subject sampling, subject non-response, exclusions because of missing data, measurement error, incorrect statistical assumptions) have no important impact on study results. The author explains how a typical epidemiological analysis implicitly makes this assumption. It is then shown how in a quantitative analysis the assumption can be replaced with a better one. A simple, everyday example to illustrate the fundamental concepts is used to begin with. The relationship between an observed relative risk, the true causal relative risk and error terms that describe the impact of study imperfections on study results is described mathematically. This mathematical description can be used to quantitatively adjust a relative-risk estimate for the combined effect of study imperfections.

## Statistics from Altmetric.com

A statistical analysis combines data with assumptions1 2 to yield a quantitative result that is a function of both. One goal of an epidemiological analysis, then, should be to combine data with good assumptions. Unfortunately, a typical quantitative epidemiological analysis combines data with an assumption for which there is neither theoretical nor empirical justification. The assumption: study imperfections (eg residual confounding, subject losses, non-random subject sampling, subject non-response, exclusions because of missing data, measurement error, incorrect statistical assumptions) have no important impact on study results.

The fundamental premise of this paper is that we can do better. Although this premise has been espoused by others,3^{–}27 only one publication other than this one has presented a mathematical foundation that can be used to account quantitatively for the combined impact of major types of study imperfections.10 Our purpose here is twofold: (1) to explain how a typical epidemiological analysis implicitly makes this assumption, and (2) to show how in a quantitative analysis the typical assumption can be replaced with a better one. We begin with a simple, everyday example to illustrate the fundamental concepts. We then describe mathematically the relationship between an observed relative risk, the true causal relative risk and error terms that describe the impact of study imperfections on study results. This mathematical description can be used to quantitatively adjust a relative-risk estimate for the combined effect of study imperfections.

## A SIMPLE EXAMPLE

Consider the following situation. You want to know how much money you have in your checking account. You look at your checking account register, and you see that the balance at the bottom of the register is US$500. You then see, however, that the last cheque entered into the register is no. 3651, but the next blank cheque is no. 3653. Cheque no. 3652 is unaccounted for. So, how much money do you have to spend in your checking account?

There are several approaches one could take to answer this question:

### Approach 1

Do not worry about cheque no. 3652. Act as if you have approximately US$500 to spend in your checking account. This is equivalent to implicitly assuming that the amount of cheque no. 3652 is approximately zero.

Is this a good approach? It is a good bet that we can do better by carefully thinking about what the amount of cheque no. 3652 might be. This leads us to approach 2.

### Approach 2

Do worry about cheque no. 3652. First, describe mathematically what you want to know (amount in checking account) as a function of what you observe (US$500) and an error term (amount of cheque no. 3652):

(amount in checking account) = US$500 − (amount of cheque no. 3652). (1)

Second, if you are not certain about the amount of cheque no. 3652, make a thoughtful assumption about its amount, and use that assumption in equation (1) to solve for the amount in your checking account. For example, perhaps upon careful reflection you know that the amount of cheque no. 3652 must be less than US$100 because you are careful with your money and surely would remember writing a cheque for that amount or more (or surely would not have forgotten to enter it into the register). Given this thoughtful assumption (that the amount of cheque no. 3652 lies somewhere between US$0 (ie a voided cheque) and US$99.99) and equation (1), you estimate that you have somewhere between US$400.01 ( = US$500 − US$99.99) and US$500 ( = US$500 − US$0) in your checking account.

Critics of approach 2 say that it is subjective. And it is: we make an assumption about the amount of the missing cheque. Can we be objective instead? The desire for objectivity leads us to consider a third approach.

### Approach 3

Do worry about cheque no. 3652. As in approach 2 above, describe mathematically what you want to know as a function of what you observe and an error term. If you are certain about the amount of cheque no. 3652, enter it into equation (1), and solve directly for the amount in your checking account. For example, if you have on-line banking, and if cheque no. 3652 has been cashed and has cleared the bank, you can see its exact amount on-line.

If, however, you are not certain about the amount of the missing cheque, and if you wish to be objective, then what value do you assign in equation (1) to “amount of cheque no. 3652”? Your only option is to assign it the value of “?” to indicate that you do not know and will not take a guess – even an educated one. Solving, then, for the amount in your checking account, you subtract “?” from US$500, which yields “?” as the answer to your question about the amount in the checking account:

? = US$500 − ?. (2)

In statistical jargon, this is called an identification problem. That is, without knowing or making assumptions about the amount of cheque no. 3652, in our example one cannot identify the amount of money in the checking account.

### Pros and cons

Approaches 1 and 2 are both subjective, and they both require the same number of assumptions. They are on equal footing in this regard.

Approach 1 has the advantage of requiring less work. It requires no thought about assumptions. Approach 2, in contrast, requires a mathematical description of the relationship between the true value, the observed result and error terms. (In real-life application, unlike our simple example above, there usually is more than one error term.) And it requires thought about the value to be assigned to each error term. In other words, approach 2 forces you to confront your errors head-on and to deal with your knowledge of, and uncertainty about, them explicitly and quantitatively.

It is certainly true that approach 1 will sometimes yield a result as good or better than approach 2. After all, “even a random number generator will give the correct answer sometimes” (as my colleague, professor Timothy Church, likes to say). We believe, however, that “careful thought” trumps “no thought” most of the time. We therefore always prefer approach 2 to approach 1.

Approach 3 has the advantage of requiring no assumptions whenever we know the values of the error terms with absolute certainty. This approach, however, loses its advantage when we are uncertain about one or more of the error terms. Then the desire for objectivity does not allow us to use information about the error terms that, although uncertain, may still be very useful.

When adjusting study results for study imperfections, in practice we never know the values of all the error terms with absolute certainty. Our best bet, then, for epidemiological situations is approach 2, which we employ below.

## THE MATHEMATICAL FOUNDATION FOR ADJUSTING AN OBSERVED RELATIVE RISK FOR STUDY IMPERFECTIONS

We can do for an epidemiological study situation what we did above in the checking account example: we can mathematically describe the relationship between what we want to know, what we observe and error terms. First, what we want to know: in equation (1) replace “amount in checking account” with *E*(RR_{causal}), the expected value of a causal–contrast ratio.17 Second, what we observe: replace “US$500” with RR_{observed}, an observed relative risk from a typical epidemiological analysis. Third, error terms: replace “amount of cheque no. 3652” with error terms (*ϵ _{i},i* = 1, ...,

*n*) that describe the impact of study imperfections on RR

_{observed}, where

denotes that the *nϵ _{i}* terms multiply. Finally, replace the subtraction sign with a division sign. These changes result in the following equation, which forms the mathematical foundation for adjusting an observed relative risk for study imperfections:

With some simple rearrangement, equation (4) describes what we observe as a function of what we want times a series of error terms:

Below we show that, when the *ϵ _{i}* are properly parameterized, equations (3) and (4) correctly describe the mathematical relationship between RR

_{observed}, RR

_{causal}and study imperfections.

### What we want to know: RR_{causal}

As shown in table 1, let *A _{i} *(

*i*= 1, 0) be the number of new cases of disease that would occur in the target (target population during the target (aetiological) time period) under exposure pattern

*i*17. Let

*B*be the denominator of a disease-frequency measure under exposure pattern

_{i}*i*.17

is a measure of disease frequency under exposure pattern *i*, with the specific form of *R _{i}* depending on the specific form of the denominator,

*B*.i

_{i}is the causal relative risk (ratio causal contrast17) for the effect of exposure pattern 1 versus exposure pattern 0ii in the target, with the specific form of RR_{causal} depending, again, on the specific form of *B _{i}*.

### Error terms: *ϵ*_{i}

_{i}

Of the two disease-frequency measures *R*_{1} and *R*_{0}, at least one (and perhaps both) will be counterfactual and therefore unobservable.17 In order to estimate RR_{causal} we therefore must use a substitute disease-frequency measure in place of a counterfactual one.17

Let *C*_{1} and *D*_{1} denote the numerator and denominator, respectively, of a substitute disease-frequency measure under exposure pattern 1. Let *E*_{0} and *F*_{0} denote the numerator and denominator, respectively, of a substitute disease-frequency measure under exposure pattern 0.

#### Scenario 1

If the target experiences exposure pattern 1 (scenario 1), then

occurs and therefore is observable (at least in theory), but we must substitute

for the counterfactual disease-frequency measure

Hence, we must substitute the association measure

for the causal–contrast measure RR_{causal}.17

Assume for the moment that the study is perfectly executed, except possibly for imperfect substitution of

for the counterfactual

Then,

In words, in equation (6) we have partitioned an observed relative risk into what we want (RR_{causal}) and a multiplicative error term. Since the only imperfection at this point in the study is imperfect substitution for a counterfactual (ie confounding17), the multiplicative error term measures the magnitude of confounding (before analytic adjustment for confounding – see *ϵ*_{statistical assumptions} below). Inspection of *ϵ*_{confounding} shows that, indeed, it is measuring the magnitude of confounding, as it is the ratio of the counterfactual

to the substitute

#### Scenario 2

If instead the target experiences exposure pattern 0 (scenario 2), then

occurs and therefore is observable (at least in theory), but we must substitute

for the counterfactual disease-frequency measure

Hence, we must substitute the association measure

for the causal–contrast measure RR_{causal}.17

Assume for the moment that the study is perfectly executed, except possibly for imperfect substitution of

for the counterfactual

Then,

In equation (8) we have partitioned an observed relative risk into what we want (RR_{causal}) and a multiplicative error term for confounding. Note that *ϵ*_{confounding} has different parameters under scenario 2 than it did under scenario 1. Although it still is the ratio of a substitute and a counterfactual, under scenario 2 the counterfactual and the substitute are different than under scenario 1.

#### Scenario 3

If instead the target experiences an exposure pattern other than 1 or 0 (scenario 3), both

and

are counterfactual. We therefore must substitute

for the counterfactual disease-frequency measure

and

for the counterfactual

Hence, we must substitute the association measure

for the causal–contrast measure RR_{causal}.17

Assume for the moment that the study is perfectly executed, except possibly for imperfect substitution for counterfactuals. Then,

In equation (10) we have partitioned an observed relative risk into what we want (RR_{causal}) and two multiplicative error terms for confounding – one that compares the substitution of

and the other that compares the substitution of

Throughout the rest of the paper we will use the notation for scenario 1. One can easily change the equations below to conform to scenario 2 or 3 by making the appropriate substitutions for counterfactuals.

*ϵ*_{losses}, *ϵ*_{sampling}, *ϵ*_{nonresponse}, *ϵ*_{misssing data}

In addition to confounding, studies typically have selection imperfections (eg losses to follow-up, non-random subject sampling, subject non-response, missing data).

Let “Followed” be the subset of the target and substitute populations that an investigator could invite to participate in a study (fig 1). For example, contact information might be available for only a subset of people in the target and substitutes; the rest are “lost to follow-up”. Let “Sampled into study” be those people whom the investigator actually invites to participate in the study. Observing all people in the target and substitute could increase study precision, whereas observing a fraction could reduce study cost or time. In general, one selects subjects to study from target and substitutes to balance trade-offs among bias, precision, study costs and study time 28 29. For example, in a case–control study, only a small fraction of the non-cases or population at risk are invited to participate in the study. Let “Participants” be those individuals who actually participate in the study; subject non-response can occur at this step. Finally, let “Data analysed” be the data analysed, after some subjects are excluded from analysis for various reasons. For example, subjects missing data on key variables are sometimes excluded from analysis.

Figures 1 and 2 show a series of nested two-by-two tables. The outer two-by-two table represents the entire target and substitute. Successively smaller tables represent the subset of the target and substitute at different points in the subject-selection process. The innermost table represents the subset of the target and substitute that is available for data analysis, after losses to follow-up, subject sampling, non-response, and exclusions from analysis because of missing data.

Figure 1 shows the notation we use to denote the number of people (or amount of person-time) in each cell of the nested two-by-two tables. For example:

A

_{1}= The number of exposed cases in the target population.a

_{11}= The number of exposed cases in the target population who are followed.a

_{12}= The number of exposed cases who are invited to participate in the study.a

_{13}= The number of exposed cases who actually participate in the study.a

_{14}= The number of exposed cases who are included in the analysis.

In fig 2, let *α*, *β*, *γ*, and *δ* be the proportion of subjects that move from a larger group into the next smaller group. For example:

*α*_{11}= The proportion of the exposed cases in the target population who are followed = a_{11}/A_{1}.*α*_{12}= Of the exposed cases who are followed, the proportion that are invited to participate in the study = a_{12}/a_{11}.*α*_{13}= Of the exposed cases who are invited to participate, the proportion that actually do participate in the study = a_{13}/a_{12}.*α*_{14}= Of the exposed cases who participate in the study, the proportion that are included in the analysis = a_{14}/a_{13}.

Assume for the moment that the only study imperfections are confounding and selection imperfections. Then, extending Kleinbaum *et al.*30 and Greenland and Criqui,31 at the end of the selection process,

In equation (11) we began by noting that at the end of the selection process (and assuming for the moment that the only other study imperfection was confounding), the observed relative risk would be calculated using the innermost two-by-two table in fig 1. We then made four substitutions (see fig 2): *a*_{14} = *A*_{1}α_{11}α_{12}α_{13}α_{14}, *b*_{14} = *B*_{1}*β*_{11}*β*_{12}*β*_{13}*β*_{14}, *e*_{04} = *E*_{0}*γ*_{01}*γ*_{02}*γ*_{03}*γ*_{04}, and *f*_{04} = *F*_{0}*δ*_{01}*δ*_{02}*δ*_{03}*δ*_{04}. We then partitioned the equation into five parts: RR_{association} and an error term for each of the four steps in the selection process as we have described it. Finally, we substituted RR_{causal}×ϵ_{confounding} for RR_{association} (see equation (6)).

In addition to confounding and selection imperfections, when exposure and outcome data are collected there may be measurement error. Let an asterisk superscript denote a variable whose value is observed with measurement error.

Assume for the moment that the only study imperfections are confounding, selection imperfections, and measurement error. Then,

In equation (12) we began by noting that with confounding, selection imperfections and measurement error only, the observed relative risk would be calculated using the measured-with-error versions of the variables in the innermost two-by-two table of fig 1: that is,

We then multiplied this observed relative risk by the factor

We algebraically rearranged the equation, and in the process we partitioned

into two terms: the relative risk that would have been observed in the absence of measurement error

and a multiplicative error factor for the effect of measurement error on the observed relative risk

We then noted that above we had previously partitioned the term

into RR_{causal} and error terms for confounding and selection imperfections (see equation (11)); we substituted these into equation (12).

The term *ϵ*_{measurement} is the ratio of two relative risks: that calculated with measurement error (as well as confounding and selection errors) and that which would have been calculated in the absence of measurement error (but, again, with confounding and selection errors). This latter relative risk is a function of the cell counts observed with measurement error (*a*^{*}_{14}, *b*^{*}_{14}, *e*^{*}_{04}, *f**_{04}) and parameters that describe the magnitude of the measurement error. For example, when there is no error in measuring the study outcome, and the exposure variable has only two levels, the cell counts that would have been observed in the absence of exposure measurement error can be expressed as a function of the observed (measured-with-error) values and exposure classification sensitivities and specificities:

where Se and Sp are the classification sensitivities and specificities, respectively. In more complicated situations, matrix algebra can be used instead of equations (13–16) to calculate the cell counts that would have been observed in the absence of measurement error.32

When data are analysed, data are combined with statistical assumptions.1 2 33 Errors can be introduced into the study results if statistical assumptions are incorrect about the relationship between variables in a statistical model (eg multiplicative vs additive), the shape of the dose–response relationship between a variable and the study outcome (eg linear vs exponential), which variables must be included in a model to control confounding, and the impact of random error on the outcome.1 2 34^{–}39

After statistical analysis,

In equation (17), we began by noting that statistical analysis yields RR_{adjusted} (a statistically adjusted relative risk). We then multiplied the equation by the factor

We algebraically rearranged the equation, and in the process we partitioned RR_{adjusted} into RR_{unadjusted} (the unadjusted, or crude, relative risk) and the impact of statistical adjustment on the unadjusted relative risk (expressed as the ratio of the adjusted to the unadjusted relative risk). We then noted that RR_{unadjusted} is simply the crude relative risk one would calculate using the observed cell counts, which is equal to

We then noted that we previously had partitioned

into RR_{causal }and error terms for the effect of other study imperfections (see equation (12)); we substituted that partitioned expression for RR_{unadjusted} into equation (17).

If statistical analysis controls confounding, then *ϵ*_{statistical assumptions}×*ϵ*_{confounding} = 1. Unfortunately, if statistical assumptions are incorrect or the data contain errors, this may not happen.

We finish our equation by including a term for the effect of random error on an observed relative risk, *ϵ*_{random}:

There are several possible sources of random error in an epidemiological study result: random allocation in a randomized trial, random sampling of subjects into a study, random measurement error, and random variation due to the stochastic nature of the study outcome. The term *ϵ*_{random} captures only the last one: random error due to a study outcome that is believed to follow a stochastic disease-occurrence model. If the study outcome is believed to follow a deterministic disease-occurrence model, then *ϵ*_{random} = 1.

In a randomized trial, random allocation of subjects to treatment levels causes random error by causing random variation in confounding 33; therefore this error is best captured by the term *ϵ*_{confounding}. In a study with random sampling of subjects into the study, random error is the result of random sampling; therefore this error is best captured by the term *ϵ*_{sampling}. When there is believed to be random error in the study result due to random measurement error, this error is best captured by *ϵ*_{measurement}.

## AN EQUATION FOR ADJUSTING AN OBSERVED RELATIVE RISK FOR STUDY IMPERFECTIONS

We can rearrange equation (18) into the same form as equation (3), to form an equation that can be used to adjust an observed relative risk for study imperfections:

Equation (19) is a mathematical description of the relationship between a causal relative risk, an observed relative risk and error terms for study imperfections.

## A TYPICAL EPIDEMIOLOGICAL STUDY ASSUMES STUDY IMPERFECTIONS CANCEL

In a typical epidemiological study, we do not quantitatively account for most of the error terms in equation (19). We attempt to account for *ϵ*_{confounding} by random allocation to treatment, restriction, stratification or modelling. Rarely, if ever, does a quantitative epidemiological analysis account for any of the other error terms in equation (19). Hence, causal interpretation of an epidemiological study result requires the implicit assumption that the unaccounted-for error terms cancel, because only when

does *E*(RRcausal) = RR_{observed}. Unfortunately, there is neither theoretical nor empirical evidence to support this implicit assumption.

We believe we can do better by carefully thinking about the values of the error terms and by using equation (19) to adjust an observed relative risk for study imperfections.

## ADJUSTING AN OBSERVED RELATIVE RISK FOR STUDY IMPERFECTIONS

### Specifying values for error terms

To adjust an observed relative risk for study imperfections, one specifies plausible ranges of values for each of the error-term parameters in equation (19), perhaps specified as probability distributions.

The term *ϵ*_{confounding} is a function of counterfactual and substitute disease frequencies. Because a counterfactual disease frequency cannot ever be directly observed, *ϵ*_{confounding} is often difficult to specify with any degree of certainty, the one exception being a large randomized trial, where *ϵ*_{confounding} is increasingly likely to be close to 1 as the number of randomly allocated subjects increases.33

When we use statistical analysis in an attempt to control confounding, if we believe that the analysis model is correctly specified and that errors in the data do not interfere with the model’s ability to control confounding, then we may believe that *ϵ*_{confounding}×*ϵ*_{statistical assumptions} = 1. These conditions are unlikely to occur in practice, however, and we therefore should usually believe that *ϵ*_{confounding}×*ϵ*_{statistical assumptions}≠1, with this product measuring the combined impact of uncontrolled confounding and error caused by incorrect statistical assumptions. If we do not use statistical analysis (eg when we calculate a crude relative-risk estimate from an observed two-by-two table), *ϵ*_{statistical assumptions} = 1.

The selection error terms (*ϵ*_{losses}, *ϵ*_{sampling}, *ϵ*_{nonresponse}, *ϵ*_{misssing data}) are functions of the proportions of people (or person-time) that move from one stage of the selection process into the next stage of the selection process as we have described it in figs 1 and 2. For example,

where *α*_{13} is the proportion of cases who agree to participate in the study from among those cases who experience exposure pattern 1 and were sampled into the study, γ_{03 }is the proportion of cases who agree to participate in the study from among those cases who experience exposure pattern 0 and were sampled into the study, *β*_{13} is the proportion of the denominator calculated from subjects who agree to participate in the study from among those who experience exposure pattern 1 and were sampled into the study, and *δ*_{03} is the proportion of the denominator calculated from subjects who agree to participate in the study from among those who experience exposure pattern 0 and were sampled into the study. Note that these selection proportions describe the selection process in the absence of exposure or disease measurement error.

Often we will know the total number of losses, the overall sampling fractions for cases and denominators, the total number of refusals, and the total number of subjects excluded from analysis for reasons of missing data. Most of the time, however, we will not know this information by exposure status. Even though this results in uncertainty in the values of the selection-error parameters, the available information can be used to put bounds on the values of the selection-error parameters.

The measurement error term *ϵ*_{measurement} is a function of observed cell counts and parameters that describe the magnitude of the measurement error (see, for example, equations (13–16)) for the study subjects that are included in the data analysis. The ideal situation is a study that has included a substudy designed to estimate the measurement error parameters.

The error term *ϵ*_{random} describes random variation in the occurrence of the study outcome due to its stochastic nature. Specification of this error term requires knowledge about the biological or social process that causes the study outcome.

In some cases, one would expect values of error-term parameters to be correlated. For example, one might expect values of *β*_{13} and *δ*_{03}, the non-response proportions for exposed and unexposed denominators, respectively, to not be wildly different. In these situations, one should specify correlations between these parameter values.

### Calculating adjusted relative risks

If one knew the value of each of the error-term parameters with certainty, one could simply use equation (19) directly to adjust an observed relative risk for study imperfections. We do not, of course, have the luxury of such certainty in practice. We have, instead, plausible ranges of values for each of the error-term parameters, perhaps specified as probability distributions. In this situation, we have several options for calculating an adjusted relative risk.

### Non-probabilistic sensitivity analysis

One option is to perform a non-probabilistic sensitivity analysis. One chooses a manageable number of combinations of error-term parameter values and uses equation (19) to repeat the adjustment calculation for each combination. For an example of this approach, see Maldonado *et al.*14 Note that for our method the order of correction is not important; one need not adjust for each error term one-at-a-time in reverse order of how the errors occurred. In our method the impact of each study imperfection on study results has been mathematically separated from the others.

### Probabilistic sensitivity analysis

Another option is to perform a probabilistic sensitivity analysis (uncertainty analysis). One uses Monte Carlo simulation21 22 (Bayesian or non-Bayesian) to randomly sample from the set of all possible combinations of error-term parameter values while accounting for the investigator’s beliefs about the relative probabilities of the parameter values. This type of analysis yields a probability distribution for the relative risk adjusted for study imperfections. One can present the entire probability distribution, or one can present a summary measure of the distribution (eg 95% interval, geometric mean, probability that the adjusted relative risk falls between two specific values).

## DISCUSSION

Because a standard epidemiological analysis does not quantitatively account for study imperfections, it implicitly assumes that study imperfections have no important impact on study results. This assumption is equivalent to the assumption that the product of the error terms for study imperfections equals 1. We know of no theoretical or empirical justification for this implicit assumption.

In this paper we have argued that this implicit assumption should be replaced with better assumptions about the impact of study imperfections on study results, and we have developed and presented a mathematical equation that an investigator can use to do so for relative-risk estimates.

Investigators often evaluate informally and qualitatively the impact of study imperfections on study results (although discussion of this evaluation actually shows up in published manuscripts much less frequently than is usually thought40). Previously published examples have shown that this informal approach can be surprisingly misleading.12 The mathematical description presented in this paper shows that the relationship between study imperfections and error in study results is a complicated one – in our opinion too complicated to be adequately handled informally and qualitatively.

There is a growing body of literature on the topic of accounting for non-random error in epidemiological study results.3^{–}27 Only one previous publication,10 however, presents a method for accounting quantitatively for the combined effect of all major study imperfections. Greenland's10 approach uses bias models with hierarchical components to take account of bias interdependencies. The current paper, in contrast, begins with counterfactual reasoning and the concept of a causal contrast (causal relative risk)17 and then describes the mathematical relationship between a causal contrast, an observed relative risk and error terms for study imperfections.

Our adjustment equation is applicable to any form of relative risk (eg incidence–proportion ratio, odds ratio, person–time rate ratio). What differentiates relative risks is the disease-incidence measure used, and what differentiates disease-incidence measures is the denominator. We have developed an adjustment equation that can be used with any desired type of disease-incidence denominator. Our adjustment equation can easily be adapted for prevalence measures (eg prevalence ratio, prevalence odds ratio); simply change the numerator of the disease-frequency measures from “new cases” to “prevalent cases”.

In practice it can be difficult to specify the values of error-term parameters with much certainty. We usually do not design our studies to estimate these parameters. Often in publications we do not give enough information to allow a reader to specify these parameters. This uncertainty has important practical implications: uncertainty in parameter values leads to uncertainty in the value of the relative risk adjusted for study imperfections. We call for change in this practice. Methods such as the one presented in this paper make clear that better information about error-term parameters is essential for proper interpretation of epidemiological study results.

#### What this study adds

This paper describes a method that enables a quantitative analysis of epidemiological data to account for study imperfections such as residual confounding, losses to follow-up, non-random subject sampling, subject non-response, missing data, measurement error and incorrect statistical assumptions. Currently, a standard quantitative analysis ignores the effect of these study imperfections on study results.

#### Policy implications

The method described in this paper enables an investigator to replace the assumption implicit in a standard quantitative epidemiological analysis that study imperfections have no important effect on study results with an explicit assumption about the likely effect of study imperfections on study results. In other words, this method enables an investigator to improve the assumptions made during quantitative analysis. This improvement in assumptions will result in improved study results, which will result in better policy decisions.

## Acknowledgments

We are grateful to Ulka Campbell, Timothy Church, Alice Cummings, Caroline Drews-Botsch, Dana Flanders, Matthew Fox, Nicolle Gatto, Michael Höfler, Pamela Johnson, Anne Jurek, Timothy Lash, Ronir Raggio Luiz, Jack Mandel, Nancy Nachreiner, Carl Phillips, Logan Spector, Andrew Ward, and several anonymous reviewers for helpful comments on earlier drafts of this manuscript.

## REFERENCES

## Footnotes

**Funding:**Early work on this topic was supported by grant number NIH/1R29-ES07986 from the National Institute of Environmental Health Sciences (NIEHS), NIH. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIEHS, NIH.**Competing interests:**None.↵i For example, if

*B*is the number of disease-free people at the beginning of the target time period, then_{i}*R*is an incidence proportion. If_{i}*B*is the number of disease-free people at the end of the target time period, then_{i}*R*is an incidence odds. If_{i}*B*is the person-time during the target time period, then_{i}*R*is a person-time incidence rate._{i}↵ii We use “exposure pattern 1” and “exposure pattern 0” to denote generally any two exposure patterns that a causal contrast compares. “Exposure pattern 1” is not meant to imply that everyone in the target or substitute is exposed and experiences the same exposure level. “Exposure pattern 0” is not meant to imply that everyone in the target or substitute is unexposed.

## Linked Articles

- In this issue