Article Text


A simple model for potential use with a misclassified binary outcome in epidemiology
  1. S W Duffy1,
  2. J Warwick1,
  3. A R W Williams2,
  4. H Keshavarz3,
  5. F Kaffashian4,
  6. T E Rohan5,
  7. F Nili6,
  8. A Sadeghi-Hassanabadi6
  1. 1Cancer Research UK Department of Epidemiology, Mathematics and Statistics, Wolfson Institute of Preventive Medicine, Queen Mary University of London, London, UK
  2. 2Department of Pathology, University of Edinburgh, UK
  3. 3Centers for Disease Control, Atlanta, USA
  4. 4Cancer Intelligence Unit, Strangeways Research Laboratory, Cambridge, UK
  5. 5Department of Epidemiology and Social Medicine, Albert Einstein College of Medicine, New York, USA
  6. 6Shiraz University of Medical Sciences, Shiraz, Iran
  1. Correspondence to:
 Professor S W Duffy
 Cancer Research UK Department of Epidemiology, Mathematics and Statistics, Wolfson Institute of Preventive Medicine, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK;


Study objective: Error in determination of disease outcome occurs in epidemiology, but such error is not usually corrected for in statistical analysis. A method of correction of risk estimates for misclassification of a binary disease outcome is developed here.

Methods: The method is a simple, closed form correction to the logistic regression estimate. A closed form variance estimate is also developed.

Setting: The method is illustrated in two studies, a cross sectional survey of cervicitis in Iran in 1996–97, as determined by inflammation on cervical smear specimens, and a case-cohort study of benign proliferative epithelial disease of the breast, in Canada 1980–88.

Main results: The method provides corrected odds ratio estimates and corrects the spurious precision conferred by misclassification.

Conclusions: The method is easy to apply and potentially useful, although potential failures of the assumptions involved should be borne in mind. It is necessary to give careful consideration to the plausibility or otherwise of the assumptions in the context of the individual study. Correction for misclassification of disease outcome may become more common with the development of readily applicable methods.

Statistics from

Misclassification of risk factors in epidemiology has received considerable attention in the biostatistical literature over the past 30 years.1–4 Misclassification of clinical outcome, for example disease status, has received less attention, although examples of such misclassification do occur.5–7 Some work has been done on its effects in the case of the two by two contingency table, but these are not specifically aimed at correction of risk estimates for the biasing effects.8–10 A reasonable formal approach to such problems, particularly when repeated determinations of disease status are performed, is the use of over-dispersion or latent variable models.11,12 A powerful and practical tool in taking into account additional error such as in determination of disease status is Monte Carlo simulation.13,14

It would be of some use in epidemiology to derive a method that directly relates to measures of relative rate of disease, such as odds ratios. In this paper, therefore, we develop a simple correction to the odds ratio to take account of misdetermination of disease status. The correction relates to a simple logistic regression model and has a closed form estimate. The correction is illustrated by two examples. The first considers risk factors for cervicitis, as determined by cytological observation of inflammation on cervical smears. The determination of inflammation is subject to misclassification. The second considers the association between dietary fat intake and risk of benign proliferative epithelial disorders of the breast. These disorders are identified histologically and are also subject to misclassification.

Our method is not a definitive solution to the problem of misclassification of disease state. It shows the effect of such misclassification on the odds ratio under certain assumptions, with an implied correction to the odds ratio, also dependent on those assumptions. Whether the assumptions, and therefore the correction, are reasonable will vary depending on the disease outcome and the method of its determination.


We develop the algebra for both symmetric misclassification and the more realistic situation of asymmetric misclassification. For simplicity of layout, we take the case of symmetric misclassification first.

Suppose we have a disease state (0 =  no disease, 1 =  disease) that is subject to random misclassification, with a probability of correct classification of α (0⩽α⩽1) and a probability of misclassification of 1−α.

We have data on disease state and on a binary risk factor in N individuals. We assume for simplicity that the risk factor is measured without error, although this is not absolutely necessary. Under certain assumptions, misclassification of the risk factor can be corrected for after correction for the misclassification of the disease outcome (see Discussion). Denote the risk factor by x (x = 0 for exposure negative. x = 1 for exposure positive). Let the overall true probability of disease be p. Under the logistic model

Embedded Image

where y1 is the log odds of case status with exposure positive and β is the log odds ratio. We also have

Embedded Image

where y0 is the log odds of case status with exposure negative.

Now note that the odds ratio is invariant to the direction of dependence in the logistic regression model. Thus if z1 is the log odds of exposure positive status given disease present and z0 the log odds given disease absent, we have

Embedded Image


Embedded Image

That is, the intercept in the logistic regression equation differs, but the regression coefficient (the log odds ratio) is the same for dependence of disease state on risk factor as for risk factor on disease state. Then

Embedded Image

However, if the observed probability of positive disease is p*, then

Embedded Image

where p is the true probability of positive disease status. Thus of those with observed positive disease status, the proportion who truly are disease cases is pα/p*. Similarly among those observed to be disease negative, the proportion truly negative is (1−p)α/(1−p*). Therefore, among those ostensibly positive, instead of observing z1 we observe w1 that has expectation

Embedded Image

Among those ostensibly negative, instead of observing z0, we observe w0 which has expectation

Embedded Image

Thus, instead of observing β̂1, which has expectation E(z1)−E(z0) we observe β̂1* that has expectation

Embedded Image

In the case of asymmetric misclassification, with α1 representing the correct classification probability for a true positive and α2 that for a true negative, we would observe

Embedded Image

If PPV denotes positive predictive value of the observed outcome and NPV its negative predictive value, both formulas can be re-expressed as

Embedded Image

So if we estimate the naive odds ratio β̂1* (even by logistic regression of disease state on risk factor), we can estimate the corrected odds ratio β̂1 by

Embedded Image

The correction factor can be expressed in terms of p* and α. In the case of symmetric misclassification, from equation (1) we have

Embedded Image

And so, after some routine algebra

Embedded Image

Embedded Image

where p* is the observed disease prevalence in the study. The corresponding formula for asymmetric misclassification is

Embedded Image

If we have repeated data a, b, c, d on disease status as in table 1, we can estimate α as

Table 1

 Notation for repeated data on disease outcome

Embedded Image

This is shown in the appendix, which also shows that the variance of α can be estimated as

Embedded Image

The variance of our corrected estimate in the case of symmetric misclassification is therefore

Embedded Image

Assuming that p* is fixed by design or is from a study sufficiently large that it can be assumed to have negligible variance, we have

Embedded Image

For asymmetric misclassification, assuming that α1 and α2 are estimated independently, the variance is estimated as

Embedded Image

Thus we have a closed form estimate of the multiplicative correction for misclassification of the variance of the corrected estimator.

This method applies only to logistic regression as it depends on the directional invariance of the odds ratio. However, it can still be used in a variety of study designs, provided logistic regression gives a reasonable approximation to the method of choice, for example log-linear or Cox regression. This essentially depends on the rare disease assumption.


Example 1 Cross sectional study of cervicitis

Surveys of cervical cytology in women in Southern Iran found a large proportion (70%–90%) of women with cervicitis.15,16 It is of some interest to identify aetiological factors responsible for this. Table 2 shows cervicitis status by bacterial vaginosis (BV) in a cross sectional survey of women in Southern Iran.17 Overall the observed prevalence of cervicitis is 85%.

Table 2

 Observed cervicitis and bacterial vaginosis

Logistic regression gives an unconditional prevalence log odds ratio of 0.48 for the association of BV with cervicitis. This implies an odds ratio of 1.62 (95% confidence intervals (CI): 1.15 to 2.26). Table 3 shows repeat determination of cervicitis by two independent cytologists, in a sample of 72 women from Southern Iran. Assuming symmetric misclassification, this yields an estimate of α of 0.91 with variance 6.90×10−4. From equation (3) our corrected estimate of disease prevalence is 0.93. From equation (4), the corrected estimate of the log odds ratio is

Table 3

 Repeat determination of cervicitis in 72 women

Embedded Image

with variance 1.16. Thus the corrected odds ratio estimate is 3.00 (95% CI: 0.35 to 25.27).

In view of the very high prevalence, one might wish to avoid the assumption of symmetric misclassification. In the absence of further data such as a gold standard validation study, the repeated data in table 3 would not be sufficient to estimate α1 and α2. We can, however, do so if we treat the second reader as if he/she were an expert panel or a definitive test method, and assume that the second reader invariably gives the correct response. In this case, we would estimate α1 as 50/56 = 0.89, with binomial variance (0.89×0.11)/56 = 0.0017 and α2 as 10/16 = 0.63 with variance 0.0146. This gives a corrected OR of 5.55 (95% CI: 0.21 to149.02). Note that it is not uncommon to encounter the situation in epidemiology where because of expense or difficulty, the gold standard is only available on a small minority of subjects. This is replicated in the current example, where the overall study size is 1121 subjects, but determination by the second cytologist is only available for 72 of them (6%).

Example 2 Diet and benign proliferative breast disease

Here we consider a case-cohort study of diet and benign proliferative breast disease (BPBD). The study population comprised 545 cases and 4921 non-cases (so p* = 0.10), a sub-cohort of 56 537 women whose dietary intakes were assessed during the Canadian national breast screening study.18 Table 4 shows the results with respect to total fat intake. Approximating the poisson (log-linear) regression by logistic regression gives an odds ratio of 0.87 (95% CI: 0.69 to 1.10). That is

Table 4

 Cases of benign proliferative breast disease and person years by total fat intake

Embedded Image

The poisson analysis gave the same estimate of β but with a slightly lower variance of 0.0126. Cases of BPBD were determined by local pathologists, although some of the biopsied cases and non-cases underwent review by a reference pathologist. Table 5 shows the cross tabulation of local and reference pathologists’ findings. Assuming the reference pathologist to be correct, we would estimate the correct classification probability for those with true status positive to be α1 = 267/280 = 0.95, and the correct classification probability for those with true status negative to be α2 = 101/328 = 0.31, an apparent violation of the assumption of symmetric misclassification.

Table 5

 Cross tabulation of findings of case status as determined by local pathologist and findings of reference pathologist

These figures, however, relate to true positives and negatives who actually underwent biopsy because of a radiological or palpable abnormality. Biopsied subjects contain a large proportion of positive cases and those referred to the reference pathologist may not be a representative sample even of the biopsied subjects. The overall true correct classification probabilities are related to α1 and α2 by

Embedded Image


Embedded Image

As the subjects received some form of breast screening or instruction in breast self examination, it is reasonable to approximate P(biopsy|+) by 1.0. That is, we assume that those who have the disease outcome undergo biopsy. We therefore take α1 as our estimate of α, with variance 0.95×0.05/280 = 0.00017. Estimation of the correct classification of truly negative subjects is rendered difficult by a lack of knowledge about P(biopsy|−), which can only be estimated after considerably more algebra and further assumptions. However, if it is approximately equal to the proportion in the sub-cohort who were biopsied with an observed negative result, it is approximately 0.1, which would yield a correct classification probability estimate of just above 0.93. The assumption is reasonable because with repeated breast screening around 10% of subjects might be expected to have at least one suspicious finding requiring further diagnostic examination. Thus the assumption of a common α equal to 0.95 seems reasonable. This gives a corrected log risk ratio of

Embedded Image

with variance calculated from equation (7) as

Embedded Image

Thus the corrected odds ratio estimate is 0.76 with 95% CI (0.48 to 1.21).


We have developed a method for adjusting risk estimates for misclassification of the disease end point. The method has the advantage of simplicity, with a closed form estimate. Although the method can adjust for both symmetric and asymmetric misclassification, sometimes the data required for the latter are not available. In our example of cervicitis, we obtained asymmetric misclassification probabilities by using the assumption that the second reader’s observation was invariably correct. In practice, one might be as reluctant to make this assumption as to adopt the symmetric misclassification model. To obtain reliable asymmetric misclassification estimates, one would require either a validation study against a gold standard measure, a third determination in the repeated data, or the assumption that true prevalence in the repeated data study was identical to that in the main study. Otherwise, one has to fall back on the assumption of symmetric misclassification.

Table 6 shows the effect of violation of the equality assumption. When the assumption is satisfied, the corrected estimate is consistent. Where it is not, with an absolute difference of more than 0.1 the corrected estimate can be substantially in error. Thus the symmetry assumption is not advisable if it is strongly suspected that there is an absolute difference of more than 10% between the two misclassification probabilities.

Table 6

 Effect of violation of equality assumption when p = 0.3, p(exp|case) = 0.6, p(exp|control) = 0.3 and OR = 3.5

Key points

This paper demonstrates a simple method for correction for bias attributable to misclassification of disease outcome in epidemiological studies.

The correction for misclassification in the cervicitis example yielded both substantial changes to the odds ratio and a considerable widening of the confidence intervals. This illustrates several points. Firstly, in the presence of misclassification, unadjusted for, there may be a significant element of spurious precision. The results from the adjusted analysis in the cervicitis study indicate that there is much more uncertainty about the relation between cervicitis and BV than an uncorrected analysis would suggest. Secondly, if the observed proportion of negative cases is close to the false negative rate, there is a considerable loss of information and consequently a wide range of uncertainty in estimation.

It is likely that, as in many epidemiological situations, most of the assumptions are incorrect. None of symmetric misclassification, equality of misclassification probability between repeat measures, or the existence of a perfect method of disease determination is likely to hold absolutely, although one or other of them may be a reasonable approximation. A good starting point is to consider the possibility of serious misdetermination of disease state. Where such misdetermination is likely on the basis of knowledge of the process of diagnosis and classification, it might then be wise to derive the estimates above, in the first instance to assess the likely magnitude of the effect of such errors on the odds ratio estimates and to gain a more realistic estimate of the uncertainty around these. The credibility of the corrected odds ratios as estimates of the true effects on risk will depend on careful consideration of the plausibility of the assumptions as approximations (because they will very rarely hold exactly true) in the context of the particular disease and method of determination. The corrected odds ratios should not be interpreted as reliable estimates of the true effects unless such consideration indicates, and all qualifications on the assumptions should be reported.

In the presence of misclassification of the risk factor in addition to that of the disease outcome, the same relation between the observed and true log odds ratios would apply, that is

Embedded Image

In this case, however, the PPV and NPV refer to positive and negative predictive values of observed for true risk factor rather than disease state. If the two misclassification processes can be assumed independent, the corrections could then be made serially. If however the probability of misclassification of disease status is related to whether or not the risk factor is misclassified, simple closed form estimates of the type proposed here are not applicable.

Policy implications

With the existence of simple methods for correction for the phenomenon, it may attract more attention in the future.

To date, misclassification of disease status has received little attention in epidemiology, compared with misclassification of risk factors. With the existence of simple methods for correction for the phenomenon, it may attract more attention in the future. The phenomenon of misclassification of disease status certainly exists—there are potential applications in psychiatric epidemiology, classification of dyskariosis on cervical smear cytology, stroke type, and numerous other areas.



With true prevalence p and symmetric misclassification probability α, the probability of agreement when the endpoint is measured twice is

Embedded Image

The probability of disagreement is

Embedded Image

Working in terms of the probability of agreement or disagreement enables us to dispense with the nuisance parameter for the prevalence. This is desirable as we may not wish to assume the prevalence in the validation sample to be equal to that in the main study. With repeated data as in table 1, the log-likelihood is therefore

Embedded Image

Equating the first derivative to zero, the maximum likelihood estimate is a solution of

Embedded Image


Embedded Image

Assuming that correct classification is more likely than incorrect, this quadratic equation solves to give

Embedded Image

The second derivative is

Embedded Image

We estimate the variance of the maximum likelihood estimate of α as

Embedded Image


View Abstract


  • Funding: We thank the estate of the late Ali Reza Soudavar for financial support for Homa Keshavarz.

  • Conflicts of interest: none declared.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles