Article Text


Age in epidemiological analysis
  1. S A Reijneveld
  1. TNO Prevention and Health, PO Box 2215, 2301 CE Leiden, Netherlands
  1. Correspondence to:
 Dr S A Reijneveld; 

Statistics from

It still merits attention

Analyses by age are among the most widely used tools from the epidemiological toolbox. They are mostly used to adjust for confounding (as a result of age) or to assess effect modification. The epidemiologist generally handles age in two ways: as a continuous variable, or as a categorised variable by combining a number of adjacent ages into a joint category. When looking at age as a continuous variable, standard epidemiological textbooks such as Rothman and Greenland’s and many others recommend registering age as precisely as possible.1 They also draw attention to the handling of age, especially to the way in which categories are chosen. Categories should not be too extensive to prevent residual confounding by age (an incomplete adjustment for age as the variation of a risk within a category is too wide).1 To stimulate comparisons of studies, the International Journal of Epidemiology even provides guidelines on forming categories: “grouping should be mid-decade to mid-decade or in five-year age groups (e.g. 35–44 or 35–39, 40–44, etc, but not 20–29, 30–39 or other groupings).” (, assessed 7 November 2002).

Surprisingly little attention is paid, however, to two other factors regarding age that are just as relevant for epidemiological analyses. The first is whether age should be included as a continuous variable or as a categorised variable in these analyses. And, if the second option is chosen, it should be clear in which way the categories are included in the model (even when defined as indicated in the previous paragraph). Many authors include age as a single variable in their model.2–4 This may be the best option as it leads to the most parsimonious model, but only if the implicit assumption of an underlying association between the modelled outcome and age is right. For instance, if a logistic model is used, inclusion of age as a single variable implies the assumption that the logit of the outcome has a linear association with age. This is equal to the assumption that p/(1−p) (in which p is the proportion of observations with a given outcome) has an exponential association with age. Such an assumption may be right, depending on the outcome, but should be assessed separately. If the assumption is proved to be right age should be included as the original continuous variable. The use of categories in this case will increase measurement error and the likelihood of residual confounding.1 If the underlying assumption is not right, however, including age as a single variable in the model may lead to residual confounding or even introduce additional confounding, and thus yield biased results. In that case, a transformation of the measurement scale of age (for instance by a logarithmic transformation, or by a power transformation like a square root or a polynomial) may still yield a valid and parsimonious model (see standard statistical textbooks such as the one by Armitage et al).5 It may be a good alternative to include dummies for each age category in the model, to exclude any assumptions regarding the association of age with the outcome.

A second factor is the impact of age in analyses in which socioeconomic position is associated with health outcomes, both at the individual and at the contextual level.6 Age may be strongly associated with socioeconomic positions at both levels. Regarding socioeconomic position at the individual level, the meaning of educational level may for instance depend on the age group concerned, at least in industrialised countries. In most of these countries, educational level is strongly associated with age. Among the elderly population, having only primary education or even less is highly prevalent, whereas it is rare among young adults. Thus among young adults having only primary education is a strong indicator of deprivation. Analytically, this may result in logistic models showing a modification by age of the association of educational level with health, whereas it actually reflects the fact that the proportion of people having a lower educational level varies by age.7 Regarding socioeconomic position at the contextual level, many of its indicators are also associated with age, for instance because they are aggregates of measures of individual socioeconomic positions. In this case, the use of crude indicators of socioeconomic position may increase measurement error regarding the association between health outcomes and measures of socioeconomic position, especially if these measures are strongly associated with age (as is the case with educational level),8 and it may even introduce confounding.

In short, even though analyses by age are among the most widely used tools from epidemiological toolbox, the adequate inclusion of age still merits attention.

It still merits attention


View Abstract

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles