Article Text
Abstract
OBJECTIVE The attributable risk (AR), which represents the proportion of cases who can be preventable when we completely eliminate a risk factor in a population, is the most commonly used epidemiological index to assess the impact of controlling a selected risk factor on community health. The goal of this paper is to develop and search for good interval estimators of the AR for casecontrol studies with matched pairs.
METHODS This paper considers five asymptotic interval estimators of the AR, including the interval estimator using Wald's statistic suggested elsewhere, the two interval estimators using the logarithmic transformations: log(x) and log(1–x), the interval estimator using the logit transformation log(x/(1–x)), and the interval estimator derived from a simple quadratic equation developed in this paper. This paper compares the finite sample performance of these five interval estimators by calculation of their coverage probability and average length in a variety of situations.
RESULTS This paper demonstrates that the interval estimator derived from the quadratic equation proposed here can not only consistently perform well with respect to the coverage probability, but also be more efficient than the interval estimator using Wald's statistic in almost all the situations considered here. This paper notes that although the interval estimator using the logarithmic transformation log(1–x) may also perform well with respect to the coverage probability, using this estimator is likely to be less efficient than the interval estimator using Wald's statistic. Finally, this paper notes that when both the underlying odds ratio (OR) and the prevalence of exposure (PE) in the case group are not large (OR ⩽2 and PE ⩽0.10), the application of the two interval estimators using the transformations log(x) and log(x/(1–x)) can be misleading. However, when both the underlying OR and PE in the case group are large (OR ⩾4 and PE ⩾0.50), the interval estimator using the logit transformation can actually outperform all the other estimators considered here in terms of efficiency.
CONCLUSIONS When there is no prior knowledge of the possible range for the underlying OR and PE, the interval estimator derived from the quadratic equation developed here for general use is recommended. When it is known that both the OR and PE in the case group are large (OR ⩾4 and PE ⩾0.50), it is recommended that the interval estimator using the logit transformation is used.
 casecontrol studies
 attributable risk
 interval estimation
Statistics from Altmetric.com
To assess the public health importance of controlling a selected risk factor, the attributable risk (AR), which represents the proportion of cases who could be preventable if we completely eliminated this risk factor in a population, is probably one of the most commonly used epidemiological indices.1 When studying a rare disease in the presence of nuisance confounders, we may often use matched pair casecontrol study design to increase the efficiency. In fact, the estimation of the AR using the retrospective data has recently received intensive discussions.219 There are, however, only a few papers that discuss estimation of the AR in matched casecontrol studies. Whittemore18 included a brief discussion on estimation of the AR for frequency matching, but noted that her approach would not be appropriate for the matched pair study, in which each stratum consisted of only one case and one control. Using Wald's statistic, Kuritz and Landis12 derived an asymptotic interval estimator of the AR. Kuritz and Landis13 further extended their result to the case of more than one matched control per case, but found that the coverage probability of their interval estimator might be less than the desired confidence level by >2% even when the number of matched pairs was as large as 100.
The purpose of this paper is to search for other better alternative interval estimators of the AR to the one using Wald's statistic for the matched pair casecontrol study. This paper considers five interval estimators of the AR, including the estimator using Wald's statistic,12 the two interval estimators using the logarithmic transformation6 17: log(x) and log(1–x), the interval estimator using the logit transformation15: log(x/(1–x)), and the interval estimator derived from a simple quadratic equation developed here. To compare the finite sample performance of these estimators, this paper calculates the exact coverage probability and the average length in a variety of situations. Finally, this paper includes an example taken from a study of oral conjugated oestrogens and endometrial cancer20 21 to illustrate the use of these interval estimators.
Methods
Consider a casecontrol study, in which we take a random sample of n subjects from the case group and for each of these randomly selected cases, we match a control with respect to some nuisance confounders to form n matched pairs. We then classify each pair according to the status of exposure into one cell of the following fourfold table:
where 0 < p_{ij} < 1 denotes the corresponding cell probability, p_{i.} = p_{i1} + p_{i2}, p_{.j} = p_{1j} + p_{2j} fori and j = 1, 2. By definition, the AR is equal to12 22: P(E‖D)(RR–1)/RR, where P(E‖D) (=p_{1.}) denotes the prevalence of exposure (PE) in the case group, and the RR denotes the relative risk of possessing the underlying disease of interest between the exposed and the unexposed. When the underlying disease is rare, we can substitute the odds ratio (OR=p_{12}/p_{21}) for the RR and use p_{1.}(p_{12}–p_{21})/p_{12} to approximate the AR. Thus, in the following discussion we assume that the underlying disease is so rare that the difference between the AR and p_{1.}(p_{12}–p_{21})/p_{12}is indistinguishable.
Let n_{ij} denote the observed frequency of pairs falling into the cell with the probability p_{ij}, wherei and j = 1, 2. The random vector n′ = (n_{11}, n_{12}, n_{21}, n_{22}) then follows the multinomial distribution with parameters n and p′ = (p_{11}, p_{12}, p_{21}, p_{22}). Note that the sample proportion p̂_{ij} = n_{ij}/n is the maximum likelihood estimator (MLE) of p_{ij}, and so are p̂_{i.} = n_{i.}/n andp̂_{.j} = n_{.j}/n, where n_{i.} = n_{i1} + n_{i2}, and n_{.j} = n_{1j} + n_{2j}, for p_{i.} and p_{.j}, respectively. Therefore, the MLE of the AR is simply ^AR =p̂_{1.}(p̂_{12}–p̂_{21})/p̂_{12}. Define the random vector p̂′ = (p̂_{11}, p̂_{12}, p̂_{21},p̂_{22}). By the Central Limit Theorem, we know that the vector √n(p̂–p)′ asymptotically follows the normal distribution with mean vector 0 and the covariance matrix D(p)–p
p′, where 0′ = (0, 0, . . ., 0) and D(p) is a 4×4 diagonal matrix with diagonal elements equal to: p_{11}, p_{12}, p_{21}, and p_{22}. By use of the delta method, we obtain the asymptotic variance of ^AR to be Var(AR^) = {(p_{12}–p_{21})^{2}p_{11} + (p^{2}
_{12} + p_{21}p_{11})^{2}/p_{12} + p^{2}
_{1.}p_{21}–[p_{1.}(p_{12}–p_{21})]^{2}}/(np^{2}
_{12}), which we can estimate by simply substituting the MLEp̂_{ij} for the unknown parameter p_{ij}. We denote this estimated variance by ^Var(^AR). These lead us to obtain the asymptotic 100(1–α)% confidence interval proposed elsewhere12 for the AR to be:
Attempting to improve the normal approximation to the statistic ^AR, we follow Katzet al
23 and consider the logarithmic transformation. Using the delta method, we obtain the estimated asymptotic variance ^Var(log(^AR)) = (^AR)^{−2}^Var(^AR). Hence, an asymptotic 100(1–α)% confidence interval for the AR is:
Following Leung and Kupper,15 we consider the logit transformation log(^AR/(1–^AR)). By the delta method again, we can easily show that the estimated asymptotic variance ^Var(log(^AR/(1–^AR)) = (^AR(1–^AR))^{−2}^Var(^AR). Hence, an asymptotic 100(1–α)% confidence interval for the AR using the logit transformation is:
Note that the logarithmic function log(x) is defined only for x >0. When the resulting estimate ^AR <0, neither interval estimator (2) nor interval estimator (3) is applicable. Consider φ = 1–AR = (p_{12}p_{2.} + p_{1.}p_{21})/p_{12}, which is always >0. Thus, following Fleiss,6 we consider the logarithmic transformation log(1–^AR) = log(^φ) rather than log(^AR) as used for deriving interval estimator (2). Note that ^Var(1–^AR) = ^Var(^AR). By use of the delta method, we obtain the estimated asymptotic variance ^Var(log(^φ)) to be ^Var(^AR)/^φ^{2}. Therefore, we obtain an asymptotic 100(1–α)% confidence interval of the AR to be:
Recall that the asymptotic variance
As n is large, the probability
These lead us to consider the following quadratic equation of
An asymptotic 100(1–α;)% confidence interval of the AR is then
Note that the coefficient A is >0 and hence the above quadratic equation is convex. Furthermore, when using the commonly used adjustment procedure for sparse data (which is described in the and in the next section), we can show that the inequality thatB ^{2}–AC>0 holds for all samples () and thereby, the two distinct roots of confidence limits (5) always exist.
Evaluation of interval estimators
To compare the performance of interval estimators (15) of the AR, we calculate the exact coverage probability and the average length of the resulting confidence intervals on the basis of the multinomial probability mass function
By definition, we calculate the coverage probability of a given interval estimator [AR_{l}, AR_{u}] as
where 1(ARε[AR_{l}, AR_{u}]]) is an indicator function and = 1 if the underlying AR falls into the interval [AR_{l}, AR_{u}], and = 0, otherwise, and where the summation is over all possible vectors n such that Similarly, we calculate the average length as Note that if n_{ij} were 0, the sample proportionp̂_{ij} would be on the boundary of 0. Thus, as noted in the , whenever any n_{ij} is 0, we apply the commonly used adjustment procedure for sparse data by adding 0.50 to each cell and using (n_{ij} + 0.5)/(n + 2) to estimate p_{ij}. Recall that if the resulting estimate ^AR (or equivalently, the estimate ^OR ⩽1) were ⩽0, interval estimators (2 and 3) would be inapplicable. Thus, for interval estimators (2 and 3), we calculate the conditional coverage probability and average length of the resulting confidence intervals under the truncated multinomial distribution, excluding those random vectors n such that the corresponding interval estimate does not exist. For completeness, we also calculate the probability of failing to produce an interval estimate using (2 and 3).
Given the values of the underlying OR, p_{1.}, and p_{12}, we can uniquely determine all the other parameters through the following equations: p_{21} = p_{12}/OR; p_{11} = p_{1.}–p_{12}; p_{22} = 1–p_{11}–p_{12}–p_{21}; and AR = p_{1.}(p_{12}–p_{21})/p_{12}. We consider the situations, in which the OR = 1, 2, 4, 8, 32; the probabilities p_{1.} and p_{12} equal: 0.01 and 0.005, 0.1 and 0.05, 0.5 and 0.25, 0.80 and 0.40, such that the combination of these parameters leads to a valid set of probability vector p for which p_{ij} >0 for alli and j; and n = 20, 50, 100, and 200. These cover the range of the AR from 0.0 to 0.775. We write programs in SAS24 to enumerate the probability of the desired multinomial distribution in our calculation.
Results
key points

When the disease is rare, casecontrol studies with matched pairs is often used. However, the research on interval estimation of the AR under this design is limited.

This paper considers and compares the performance of five asymptotic interval estimators of the AR, including the one proposed recently on the basis of Wald's statistic.

This paper demonstrates that the interval estimator derived from a quadratic equation developed here is generally preferable to the one based on Wald's statistic.

This paper provides a general guideline about the selection of better interval estimators with respect to the coverage probability and the average length of the confidence intervals.
Table 1 summarises the coverage probability and the average length of the 95% confidence interval in application of interval estimators (15). Firstly, note that when the underlying OR = 1 (that is, AR = 0), the coverage probability of the 95% confidence interval for both (2) and (3) is 0%. Note also that the coverage probability of the asymptotic 95% confidence interval using either (4) or (5) is almost always larger than or approximately equal to the desired confidence level in the situations considered in table 1, whereas the coverage probability of using (1) is occasionally less than this desired confidence level by >2% to 3% when n is not large (⩽100). When comparing the average length of interval estimator (5) with that of (1), as shown in table 1, we find that the former is generally more efficient than the latter. When both the OR and the PE in the case group are moderate or large (OR ⩾4, and p_{1.} ⩾0.50), we find that interval estimator (3) can even be slightly more efficient than (5), while maintaining the coverage probability ⩾95%. Note that when the PE is small (p_{1.} ⩽0.10), the probability of failing to produce an interval estimate using (2) and (3) can be substantial (table 2). When the OR is large (⩾4), this probability diminishes, however, to approximately 0 as p_{1.} increases to 0.80.
AN EXAMPLE
To illustrate use of interval estimators (15), we consider the data that are consisted of 183 pairs taken from a casecontrol study of the use of oral conjugated oestrogens and the endometrial cancer.12 20 21 We match each case with a control on race, age (within five years), date of admission (within 6 months), and hospital of admission. We then classify these 183 matched pairs according to their exposure status (ever versus never) with regard to use of the estrogens. We obtain n_{11} = 12, n_{12} = 43, n_{21} = 7, and n_{22} = 121. Suppose that we are interested in estimation of the AR of endometrial cancer attributable to the use of the oestrogens. As given elsewhere,12 we obtain the estimate ^AR to be 0.252. Furthermore, when using interval estimators (15), we obtain the asymptotic 95% confidence intervals to be: [0.172, 0.331], [0.183, 0.345], [0.181, 0.339], [0.168, 0.327], and [0.167, 0.325], respectively. We see that the resulting 95% confidence intervals using (2) and (3) tends to slightly shift to the right as compared with the other three resulting interval estimates (1), (4), and (5), which are all similar to one another.
Discussion
To evaluate whether it is appropriate to apply interval estimators (15) in the particular configuration given by the example, we consider the situations in which the parameters are determined by the empirical estimates from the data: ^AR = 6.14,p̂_{1.} = 0.30, p̂_{12} = 0.24, and n = 183. In application of interval estimators (15), we obtain the coverage probability and the average length (in parentheses) of the corresponding asymptotic 95% confidence intervals to be: 0.948 (0.158), 0.953 (0.161), 0.956 (0.158), 0.949 (0.159), and 0.950 (0.157). We can see that all interval estimators (15) perform reasonably well with respect to the coverage probability and interval (5) seems to be slightly more efficient than the others in terms of the average length. This is certainly consistent with the finding that interval estimator (5) is generally more efficient than the others unless both the RR and the PE are moderate or large (RR ⩾4, p_{1.} ⩾0.50) as presented in table 1.
Note that the functions exp(x) and exp(x)/[1 + exp(x)] are always positive and so are both the lower limits of interval estimators (2) and (3). Thus, if the underlying AR were 0, the coverage probability of these interval estimator would obviously be 0. This explains the reason why the coverage probability of (2) and (3) is 0 when the underlying OR = 1 regardless of the sample size n (table 1). Furthermore, if the PE were small (p_{1.} ⩽0.10), then both the probabilities p_{12} and p_{21} (=p_{12}/OR, where OR ⩾1) would even be close to 0. Thus, the probability that the difference between the estimates p̂_{12} andp̂_{21} is ⩽0 (and hence the resulting estimate ^AR ⩽1) can be substantial. This accounts for the finding that the probability of failing to produce an interval estimate using (2) and (3) can be quite large in this case (table 2).
We find that except for a few cases where the PE in the case group is large (p_{1.} = 0.80), interval estimator (1) using Wald's statistic does perform reasonably well. While applying interval estimator (4) using the transformation log(1–x) can improve the coverage probability of applying (1), using the former is likely to lose efficiency as compared with the latter. By contrast, applying interval estimator (5) can generally not only improve the coverage probability of (1) but also increase the efficiency. Thus, we recommend interval estimator (5) for general use. When we know that both the underlying RR and p_{1.} are not small (RR ⩾4 and p_{1.} ⩾0.50) from our prior studies, however, we may wish to use interval estimator (3) as well, especially when n is not large.
Finally, note that although interval estimators considered here are derived on the basis of large sample theory, we note that interval estimators (1, 4, and 5) can generally, as shown here, perform well with respect to the coverage probability even when the number of matched pairs n is as small as 20 (table 1). Furthermore, interval estimators (3 and 4) could also perform well for n = 20 if the underlying OR and PE were not small (OR ⩾4 and p_{1.}⩾0.10). Because it would be quite rare for public health administrators to estimate the AR based on data with less than 20 cases, the situations considered here should cover most cases encountered in practice.
In summary, this paper considers five interval estimators of the AR for the matched pair casecontrol studies. This paper includes a discussion that provides an insight into the characteristics of the performance of these five interval estimators. This paper shows that the interval estimator derived from a quadratic equation suggested here can outperform the interval estimator using Wald's statistic proposed elsewhere. This paper further notes that interval estimators using the two logarithmic transformations log(x) and log(1–x) generally causes the loss of efficiency. Finally, this paper notes that the interval estimator using the logit transformation can be useful when both the underlying RR and the PE in the case group are large. The discussion and the findings presented here should have use for biostatisticians and epidemiologists when they want to estimate the AR using a matched pair casecontrol study.
Appendix
Firstly, note that if we obtained the estimatep̂_{ij} to be 0 for some cell (i, j) from a given sample, thenp̂_{ij} would be on the boundary. To avoid this concern, ifp̂_{ij} should be 0 for some cell (i, j), we would recommend using the commonly used adjustment procedure for sparse data by adding 0.50 to each cell and using (n_{ij} + 0.50)/(n + 2) to estimate p_{ij}. Thus, we may assume that the resulting estimate p̂_{ij}always falls in 0 < p̂_{ij}< 1.
Note thatB ^{2}–AC= Z^{2} _{α/2}(G*–^AR^{2})/n + Z^{4} _{α/2} G*/n^{2}, where G* = {(p̂_{12}–p̂_{21})^{2}p̂_{11}+ (p̂^{2} _{12} +p̂_{21}p̂_{11})^{2}/p̂_{12}+p̂^{2} _{1.}p̂_{21}}/p^{2} _{12}. Note that the asymptotic variance Var(^AR), that equals {[(p_{12}–p_{21})^{2}p_{11} + (p^{2} _{12}+ p_{21}p_{11})^{2}/p_{12} + p^{2} _{1.}p_{21}]/p^{2} _{21}–AR^{2}}/n, is always ⩾0, for any vector p′ = (p_{11}, p_{12}, p_{21}, p_{22}), that satisfies 0 < p_{ij} < 1 and Because we can easily show that we can obtain G*–^AR^{2}by simply substituting the particular estimatep̂_{ij} (which obviously satisfies 0 < p̂_{ij} < 1 and for p_{ij} in nVar(^AR), the inequality: G*–^AR^{2}⩾ 0 is always true. Furthermore, when 0 <p̂_{ij} < 1, we can easily see that G* >0. These results suggest that the conditionB ^{2}–AC= Z^{2} _{α/2}(G*–^AR^{2})/n + Z^{4} _{α/2} G*/n^{2}should be >0 for all samples.