We provide a relatively non-technical glossary of terms and a description of the tools used in spatial or geographical epidemiology and associated geographical information systems. Statistical topics included cover adjustment and standardisation to allow for demographic and other background differences, data structures, data smoothing, spatial autocorrelation and spatial regression. We also discuss the rationale for geographical epidemiology and specific techniques such as disease clustering, disease mapping, ecological analyses, geographical information systems and global positioning systems.
- GIS, geographical information system
- GPS, global positioning system
- SMR, standardised mortality ratio
Statistics from Altmetric.com
“Place” can usually be applied as a surrogate for the interaction between genetic factors, lifestyle and environment.1 Although the role of place in human health has been recognised historically,2 the focus in public health research has mostly been on person and time, with little consideration of the implications of place.3 Most public health specialists seem to have forgotten the space dimension of disease processes.4 This is a pity, as comparison between places, together with comparisons between times and between individuals, is a useful means of formulating and testing aetiological hypotheses. In addition, from the perspective of public health practice, knowledge that a health problem is concentrated in identifiable places is essential for the efficient distribution of resources for prevention, treatment or amelioration. There have been several reasons for this apparent lack of interest in place, including a dearth of appropriate databases and insufficient appropriate software.5 However, substantial recent advances in geographical information systems (GISs) now provide researchers and public health practitioners with an excellent environment in which to explore their data.6 In addition, there is an increasing number of public health databases, in which the locations of the cases are recorded. It seems likely, therefore, that once they have understood its utility, scientists and public health practitioners will seek to use this spatial information.
At first glance, spatial analysis and its tools appear dauntingly complicated. This is not so, but there is a need for a glossary to explain common terms in geographical epidemiology, spatial analysis and GISs.
Age, sex, socioeconomic and other variables vary from one place to another and may also influence the risk of the diseases. Observed differences in risk of illness or death are likely to be confounded by these variables and, therefore, comparisons of risk must take this important issue into account. The process of adjustment for potential confounding variables has an important role in the evaluation of the spatial variation in mortality and disease rates. The aim of an adjustment process is to produce a single summary value, such as the standardised incidence rate ratio (see below), which is unaffected by differences in the distributions of potential confounders.7 The two most common approaches of adjustment are by direct and indirect weighting of stratum-specific rates. We illustrate the idea using adjustments for age differences, as age is almost always considered to be a confounding variable in epidemiological studies. However, we emphasise that the same procedure could be applied in adjustments to take account of other confounder variables.
In the direct approach, a weighted average of the age-specific rates from a study population is created, based on the age distribution of a reference population8—that is, the national population.9 This is an estimate of the expected number of deaths in the reference population if the age-specific rates were the same as those that have been observed in the study population. An easily interpreted ratio is then obtained by dividing the expected number of deaths in the reference population by the observed number of deaths in the reference population over the same period of time.10 This ratio is termed either the comparative mortality figure or standardised incidence rate ratio, and was first proposed in 1884.11 In the indirect method, the crude rate in the study population is multiplied by a ratio known as the standardised mortality ratio (SMR).8 The SMR is calculated by dividing the observed number (O) of cases within the study population by the expected number (E) of cases in the study population, assuming that the age-specific rates in the reference population also applied to the population under study.
It should be noted that if, for instance, the age distributions of two areas differ, the comparison of their SMRs (determined by reference to an external reference population) may have a potential bias comparable with statistical confounding.10 In other words, when compared with an external reference population, the indirect method of standardisation (the SMR) yields different rate ratios for cohorts with a different demographic structure even though the incidence rates within the demographic strata are identical. Despite the cautions raised in the literature, however, SMRs have been recommended for mapping by well-known statisticians.12 However, directly adjusted rates also have problems. For instance, they may provide less stable estimates because the standard error of the rates depends on variations in the age-specific number of cases rather than the total number of cases.10 In practice, the choice of selecting one of these methods depends on the type of data available. The indirect method is the only choice, for example, when the age-specific incidence counts are unavailable for a reference population but its age-specific rates are available.13
Data for spatial analysis
There are usually two important types of spatial data: point and area data. Each item of health data (including population, environmental exposure, mortality and morbidity) may be connected with a point, or precise spatial position such as a home, a street address or an area, which could be defined as a spatial region by postcode, ward, local authority, province and country.14 A public health specialist may also come across spatial data in the form of continuous surface, such as the statistical surfaces of pollution interpolated from fixed-point characteristics.15
As data for spatial analysis come from different sources, and have often been collected without taking into account the interests of the geographical epidemiologists,16 it is absolutely necessary to ensure that precise and complete point and/or area health data are used in spatial epidemiology.17–20 In the developed world, most of the mortality and cancer incidence data have good quality. Nevertheless, other health data such as rates of suicide, congenital anomalies and hospital admissions may be subject to partial ascertainment (rates are underestimated). In addition, the diagnosis, collection, coding and reporting of a given health outcome may differ between geographical regions and over time.20
The danger of ignoring data-quality issues is that, because of missing cases or inaccurate baseline population data, one might arrive at a misleading (invalid) high or low estimated risk.21,22 Confidentiality may also be an important issue. Breaching the confidentiality of spatial data may cause concern, especially when it discloses areas with high rates of morbidity/mortality or high levels of pollutants.14
Searching for disease clustering is one of the branches of geographical epidemiology that involves an assessment of local or global accumulation of disease.23 There are different types of clustering, including general and specific. General clustering involves the analysis of the overall clustering tendency of the disease incidence in a study region, and is paralleled by the assessment of global spatial autocorrelation, in which the exact location of clusters is not investigated. The second type of investigation of clustering uses specific disease-clustering methods, which are designed to examine the exact location of the clusters.24 As we will discuss the importance of, and the ways of detecting, global and local clustering in areal data in the section below on spatial autocorrelation, here we will focus only on the detection of clusters in point data.
Methods for the detection clusters in point format data are more numerous than those for areal format data, and are usually divided into the following three groups: global, localised and focused (ie, assesses clustering around a putative source).25 There are a number of tests available that help to assess different kinds of clusters in point format data. However, we will discuss only three of them very briefly, and refer the readers to Bailey and Gatrell,6 and Gatrell et al26 for a complete discussion. Cuzick and Edwards’27 method determines global clustering by examining the k nearest neighbours of each case. The geographical analysis machine28 and the spatial scan statistic29 assess the localised clustering by drawing circles of different sizes over the area of study and compare the risk of disease inside and outside of each circle. The spatial scan statistic has an advantage over geographical analysis machine in taking into account the problems of multiple testing.
Data visualisation is the first step in disclosing the complex structure in data.30 Data visualisation may not only create interest and attract the attention of the viewer but also provide a way of discovering the unexpected.31 Although plots of data and other graphical displays are among the fundamental tools for analysts in general, for a spatial analyst, visualising spatial data usually means using a map.6 Disease mapping is one of the branches of geographical epidemiology fulfilling the need to create accurate maps of disease morbidity and mortality.23 For instance, dot or dot-density maps are used to display point data, whereas choropleth maps are used for areal data, and contour or isopleth maps are used for continuous surface data.15 The use of mapping in the medical context has developed so rapidly during recent decades32 that the presentation of maps is now established as a basic tool in the analysis of public health data.23
There are two main classes of disease maps for areal data: maps of standardised rates and maps of statistical significance of the difference between disease risk in each area and the overall risk averaged over the whole map.33 There are pros and cons for each of these classes. For instance, mapping rates in small areas tend to create a misleading picture (see the section Smoothing) while using statistical significance, particularly in areas with large populations, produce small p values indicating statistical significance, but do not disclose scientifically interesting differences.34 The mapping of standardised rates is generally preferred to the mapping of p values, controlling for the influence of sampling variation by using a smoothing technique (see the section on Smoothing).35
There are also other important issues that need to be considered while creating a map. These include the selection of an appropriate administrative unit for mapping, the selection of an appropriate method of data classification in the map, and the selection of an appropriate colour scheme or collection of hatching patterns. We will not discuss these issues in detail here. We will cover the optimum choice of mapping regions very briefly in the following section, and for the other issues refer the readers to other sources, detailed in the reference list.18,36,37
Ecological analysis is defined as the assessment of the associations between disease incidence (eg, suicide) and variables of interest (eg, social or environmental covariates).23,38 These variables in an ecological analysis are defined on aggregated groups of individuals rather than the individuals themselves.39 The reason for focusing on the comparison of groups rather than individuals is that individual-level data on the joint distribution of two or more variables within each group are usually missing. Therefore, an ecological study may be considered to be based on an incomplete design.40
An ecological analysis can be crucially dependent on scale (ie, the region based on the hypotheses under study).1 The optimum choice of scale is a trade-off between making the groups (regions) large enough to have stable rate estimates and also small enough to make them homogeneous in terms of their socioeconomic and other important characteristics.19 If the regions are large, there is a greater possibility that associations measured at the aggregate level will differ from the same association measured at individual level. This can lead to a problem known as ecological fallacy,41 or cross-level or ecological bias42—a situation in which one mistakenly infers an individual-level association from one that is actually only observed at the regional level. At the same time, if the regions chosen are too small, the results may show spurious spatial patterns due to random variation in small numbers of events.43,44
The scale dependency of data may, therefore, cause what is known as the modifiable areal unit problem, which arises from the uncertainty induced by the aggregation procedure.15 For this reason, it is important to take the scale of analyses into account and, if possible, to analyse the data at two or more levels of aggregation.45 It is also possible to overcome both scale dependency and ecological bias by adopting a multilevel approach using the individual-level and group-level data together (see the next section).46 For instance, to understand the effect of income on suicide,47 we have to have data on the individual income and/or household income and area-level income, the latter of which may be “compositional” (ie, fully explained by the individual-level data) or “contextual” (irreducible to the individual—ie, an effect that persists even after allowance for the individual-level data).48
Geographical epidemiology can be defined as the description of spatial patterns of disease morbidity and mortality, part of descriptive epidemiological studies, with the aim of formulating hypotheses about the aetiology of diseases.49 One can identify different branches in geographical epidemiology, which is a reflection of the different needs of public health specialists and epidemiologists in the assessment of ill-health aetiology.23 Predominant among the methods of geographical epidemiology are the following: disease mapping, disease clustering and ecological analysis.23 There is usually a close relationship between these branches.50
However, as almost all geographical epidemiological studies are descriptive in nature and depend on scale, one should bear in mind that a more comprehensive picture of a spatial problem can be achieved when the results of geographical aggregate-level data are combined with those at the individual level.51 Multilevel modelling, hierarchical regression and contextual analysis are phrases describing one of the various statistical methods in which this combination is allowed.52 Multilevel modelling is a powerful, relatively new technique53 that can be used to determine how much of the ecological effect can be explained by variations in the distribution of individual-level risk factors,52 and recently attempts have been made to integrate this kind of analysis into geographical epidemiology.54,55 There are also new developments incorporating time changes along with spatial variation. Such models are able to provide new insights into the aetiology of diseases that are otherwise unavailable.15
Geographical information systems
GISs can be defined as software systems for the automatic capture, storage, retrieval, analysis and display of spatial data.56 The development of GIS technology dates back to the 1960s.57 GISs have dramatically changed the ability of epidemiologists and public health specialists to work with spatial data.4 The advantages of GISs are many, and include an ability to operate repetitive tasks, quickly compare spatial data and handle large volumes of data.4 Other advantages include the ability to ask “what if” questions (asking conditional questions—ie, what if we locate a given hospital in place “A” rather than “B”), Boolean searches (finding places that fulfil two or more criteria—ie, areas of high mortality and poverty), creation of “buffer zones” (circle a point data or centroid of an area data—eg, a 5-km circle around a putative source) and using data from remote sensing and global positioning systems (GPSs).4
However, historically, GISs have relied on their mapping capabilities rather than performance of statistical analyses. This is clear from the limited number and types of statistical analyses that most GISs are able to perform.3 Until the full integration of spatial statistical analysis into a GIS environment is achieved, other solutions should be applied.3 These include developing a “loose coupling” between a statistical package and a GIS, or a “close coupling” designing either some statistical functions within the GIS or adding some GIS tools into an analytical package.6 For instance, one of the best known GIS software is ArcView,58 for which a number of links with some statistical packages have been developed.59,60 When GISs are combined with spatial analytical methods, the result could provide a helpful tool in the study of public health issues.61 Nevertheless, the users of GISs and readers of the output should not study the attractive maps produced by the software uncritically, and they should always remember the rules of good data management, analysis, presentation and interpretation.37,62
Global positioning systems
A GPS consists of a system of at least 24 and up to 32 solar-powered satellites orbiting Earth every 12 h and transmitting radio pulses at very precisely timed intervals.63 To determine a position in three dimensions (latitude, longitude and elevation), a receiver needs signals from at least four satellites.64 GPS has become a standard method for data capturing in geographical epidemiology and public health studies.25 Moreover, as the different components of a GPS receiver work efficiently under severe weather conditions such as sandstorm, torrential rain and high temperature, they could have a key role in combination with GISs, especially in emergency humanitarian activities.65
Mapping disease mortality or morbidity, especially in the smaller geographical areas, or when the given disease is somewhat rare, may give rise to the problem of small numbers, which in turn produces unstable rates. Although greater stability of rates may be achieved by choosing larger areas, simple mapping of the raw data is unattractive in that it still yields sudden changes at geographical boundaries.66 In such circumstances, it is advantageous to “smooth” the local risk estimate on the basis of the overall pattern of rates.6
The basis of this technique is that when the underlying population of a given area is large and as a result the statistical error of the rate estimate is small, the adjusted rate will be close to the observed rate. However, when the underlying population is small and, therefore, the statistical error correspondingly large, the observed rate is shrunk by smoothing towards a value representing the overall mean of the map.67 If spatial autocorrelation tests confirm that there is a spatial dependency (see the next section), the rates can be adjusted towards averages of neighbouring rates rather than the overall mean.67 When we wish to improve the quality of a rate estimate for an area with an unstable rate by “borrowing strength” from its neighbours, a Bayesian analysis may be applied.68
When using smoothing, we are in effect making a prior assumption that a rate estimate for a given area is better if it in some way makes a combination of data from the area itself and those from the surrounding areas. A Bayesian analysis is one way of achieving this combination. In a Bayesian analysis, an assumed (prior) probability distribution for the values of a parameter (the area rate) is converted (under the influence of the observations—ie, the observed rates) to a posterior (ie, after using the observed data) distribution for the values of that parameter. This posterior distribution is then used to provide an estimate for the parameter(the estimated rate for a given area) together with a standard error for this estimate. With such an approach, the prior distribution can be based on the results of previous studies or on background knowledge. It is also possible to base this distribution on particular global aspects of the data currently at hand. The latter approach is usually referred to as empirical Bayes estimation.6 There is a non-iterative empirical Bayes method of moments for smoothing the rates towards either local or global mean.69 Although this method is useful to estimate the relative risks of a given disease and in this sense it functions in a similar way as the fully Bayesian approach, the latter produces more informative interval estimates.70,71 The technical details of Bayesian methods are well beyond the scope of this article but, in essence, the Bayesian approach uses the observed data to update prior knowledge. If there is a large number of observations, then the prior knowledge has little influence (ie, the observed rates provide good estimates); if not, the prior knowledge is used to reduce (smooth) the sampling fluctuations between the unreliable observed rates.
Lack of independence of data from neighbouring areas gives rise to spatial autocorrelation. The correlation or dependency implies that rates for geographically close areas are more highly related than those from areas that are geographically distant.72 For instance, suicide rates in neighbouring areas are likely to be more similar than those in distant ones.15 This is because neighbouring areas may have similar underlying social, economic and cultural characteristics that trigger suicidal behaviour. Detecting spatial dependency, which is accomplished by the use of spatial autocorrelation statistics, would help researchers to justify their selected regression models in an ecological analysis, or their smoothing techniques when mapping a rare disease or when mapping in small boundaries (see the sections on ecological analysis, spatial regression and smoothing).73
Spatial autocorrelation statistics provide very useful summary information about the spatial arrangement of data in a map.73 In fact, some of these statistics compare neighbouring area values to assess the level of large-scale or global clustering. Whenever a large number of neighbouring areas have either reasonably large or small values, large-scale clustering may be detected.4 The two most commonly used spatial autocorrelation statistics for detecting global clustering in continuous areal data (ie, morbidity and mortality rates) are the I statistic, developed by Moran,74 and Geary’s c statistic.75 There is also a number of spatial autocorrelation statistics available (Getis and Ord’s G* statistic), which measure the amount of local clustering (ie, hot spots of high or low values) by finding any association between a value at a particular area and values of adjacent or nearby areas.25,76
Typically, we use some form of regression model in an ecological analysis to predict, for example, suicide rates in given areas with the area’s other attribute data such as poverty or social cohesion15 (see the section on ecological analysis). In such situations,77 we usually divide the study region into a set of non-overlapping, administratively defined areas, and model the counts of the number of cases within each area.78 These will be accompanied by information on the population at risk in different relevant age and sex groups in addition to other factors such as socioeconomic status. If the risk within a given area is constant, the distribution of the count for that area is clearly a binomial distribution.34 However, if the risk is small (eg, suicide mortality), one may approximate the binomial distribution by the Poisson distribution.17 If there is evidence that the Poisson model does not fit well, this indicates that there is a component risk that has not been incorporated into the model. This is an example of “extra-Poisson” variation or overdispersion.79 The overdispersion may arise for several reasons, including large numbers of cells having zero counts, which may happen when a rare disease is being studied.36 To cope with this problem, the Poisson model can be replaced by the negative binomial regression model.80
In case the exploratory techniques uncover the nature of spatial dependence (see the section on spatial autocorrelation), a richer model involving spatial autocorrelation may be fitted.34 There are two empirical Bayesian models in which the spatial dependency of the data can be taken into account.81 The first model is a simultaneous autoregressive model,82 and the second is a conditional autoregressive model.83 The conditional autoregressive model provides a more general framework with less complexity, and is therefore preferred to the simultaneous autoregressive model.77 More recently, the WinBUGS software has provided a fully Bayesian analysis of the conditional autoregressive model.84
We thank the two anonymous referees who provided helpful comments on an earlier draft of this glossary.
Competing interests: None declared.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.