Article Text
Abstract
Study objective: To present a conceptual framework for testing differences in mortality for small geographical areas over time using the generalised linear model with generalised estimating equations. This framework can be used to test whether the magnitude of regional inequalities in health status has changed over time.
Design: A Poisson regression model for correlated data is used to investigate the relation of population health status to demographic, geographical, and temporal explanatory variables. Differences between regions at one or more points in time are tested with linear contrasts.
Setting and participants: A case example shows the application of the framework. All cause mortality and cause specific mortality were compared for three rural regions of Manitoba, Canada between 1985 and 1999. The data were obtained from Vital Statistics records and the provincial health registry.
Main results: Tests of linear contrasts on the regression coefficients for time and region show an increase in the magnitude of the difference in the risk of all cause mortality and heart disease mortality between northern and southern regions of the province for the 1985–1989 and 1995–1999 time periods. No significant differences are identified for cancer, injury, or respiratory disease mortality.
Conclusions: The proposed framework enables testing of a variety of hypotheses about differences between regions and time periods and can be applied to other measures of population health status.
 small area analysis
 longitudinal data
 generalised linear model
 GEE, generalised estimating equation
 GLM, generalised linear model
 RHA, regional health authority
 PMR, premature mortality rate
Statistics from Altmetric.com
 GEE, generalised estimating equation
 GLM, generalised linear model
 RHA, regional health authority
 PMR, premature mortality rate
Researchers and public health officials have long been interested in identifying inequalities in health status for small geographical areas.^{1–}^{4} Persistent differences between areas, particularly those that appear to be increasing in magnitude over time, may be the basis for resource allocation decisions and public health surveillance activities at both the local and national levels. They may also be the impetus for further research on the determinants of health, including studies of small area variations in health services utilisation, socioeconomic characteristics, and environmental exposures.^{5}
Descriptive plots of age standardised death rates over time for two or more areas provide only the crudest indication of whether health inequalities exist. In small populations, a single death in a given area and/or year may have a large impact on the observed rates. This inherent variability in rates for small geographical areas may obscure trends in the data. As well, descriptive analyses of the data are of limited value for assessing whether the magnitude of the difference between regions is changing over time, that is, whether there is an interaction between geography and time.
Regression techniques have been used to examine differences in mortality for small areas. However, researchers may be unfamiliar with techniques for correlated data that can be used to model longitudinal data and with the formulation of hypotheses on the interaction regression coefficients that are used to test for differences between small areas. Longitudinal data can be modelled as a function of demographic, geographical, and temporal explanatory variables using a statistical procedure that accounts for the correlation between successive time points. Hypotheses on the time x region regression coefficients are then used to test for differences between regions over time.
This paper presents a framework for applying regression techniques to correlated data to test for regional differences in mortality. It is developed within the context of the generalised linear model (GLM), which has been applied to longitudinal data for a variety of health indicators.^{6} The GLM framework subsumes a broad class of statistical models, including the Poisson model, which is often used to describe the distribution of rare population events such as death.^{7–}^{13} Within this framework, we show how to test hypotheses of differences between regions and time periods. A case example builds on 15 years of mortality data for the rural regional health authorities (RHAs) of Manitoba, Canada.
MODEL BASED FRAMEWORK
We begin by defining the model for the simplest case, where counts of the number of deaths in a region are obtained at a single point in time. Let Y_{i}, n_{i}, and r_{i} = Y_{i}/n_{i} respectively represent the number of deaths, total population size, and observed mortality rate for the ith stratum (i = 1, ..., n). The strata are created by defining discrete categories for the values of each model covariate (that is, age group, region) and then classifying the population into cells by forming all possible combinations of these covariate categories.
It is assumed that the Y_{i}s are independently distributed as Poisson variates. The expected number of deaths in each stratum, E(Y_{i}) = μ_{i}, is modelled as
where X_{i} is the vector of predictor variables that describes each stratum, β is a pvector of population regression parameters, and λ(X_{i}β) is the underlying rate function which is estimated by r_{i}. This multiplicative model can be expressed as a generalised linear model (GLM), which relates the expected value of the outcome value to the predictor variables via a link function. For the Poisson distribution the log link function is used,
When the data are classified into strata on the basis of age group and region in a two way contingency table, the GLM model can be expressed as
where μ_{0} is the model intercept, α_{j} is the “effect” of the jth age group (j = 1, ..., J) and δ_{k} is the effect of the kth region (k = 1, ..., K).
To extend this model to longitudinal data, let Y_{i} = [Y_{i}_{1} ... Y_{it}]^{T}, represent the vector of counts of the number of deaths for the ith stratum for t points in time, where A^{T} is the transpose of A. Furthermore, let μ_{i} = [μ_{i}_{1}μ_{i}_{2}, ..., μ_{it}]^{T} denote E(Y_{i}^{T}). It is assumed that the elements of Y_{i} are correlated, where V_{i} represents the covariance matrix. The model is
where X_{it} is the p vector of covariates for Y_{it}. Generalised estimating equations (GEEs) were developed so that the GLM could be applied to correlated data.^{14} GEEs are a series of equations, which are solved iteratively to estimate β when the structure of V_{i} is specified. GEEs account for the correlation between successive time points, ensuring that the standard errors of the regression coefficients are correctly estimated and tests of these coefficients are unbiased and efficient. A variety of structures that can model the relation between successive time points are described by Carriere et al.^{6}
Using the notation of equation 3, a model for longitudinal mortality data is
where γ_{m} is the effect associated with the mth time period (m = 1, ..., t) and η_{km} is the effect of the kth region and mth time period. The region × time interaction term is the basis for testing hypotheses about changes in regional differences over time. Additional two way and three way interaction terms may be included in the model. The addition of interaction terms will be guided by the hypotheses of interest, theoretical considerations, and their contribution to model fit.
Only rarely will the researcher be interested in testing the omnibus hypothesis for an interaction. Rather, contrasts, which are linear combinations of the regression coefficients, will be of greatest interest. To test for a difference between two regions, k and k′, in a single year while holding age constant*, the null hypothesis is H_{0}: ψ_{kk′} = 0, where
which can be estimated by ψ̂_{kk′} = δ̂_{k} + η̂_{km} − δ̂_{k′} − η̂_{k′m}. A Wald statistic,
can be used to test the null hypothesis. The statistic Z^{2}_{kk′} is referred to the critical value, χ^{2}[1−α; 1], the (1−α) centile of the χ^{2} distribution with one degree of freedom. Exp(ψ̂_{kk′}) represents the ratio of the average mortality risk for region k’ to the average for region k in the mth time period, holding age constant. A positive value for Exp(ψ̂_{kk′}) indicates that the relative risk of mortality for region k’ is higher than that of region k at this single point in time.
To test for a difference between two regions, k and k′, and two years, m and m′, the null hypothesis is H_{0}: ψ_{kk′mm′} = 0, and the contrast is given by
which can be estimated by ψ̂_{kk′mm′} = η̂_{km} − η̂_{k′m} − η̂_{km′} + η̂_{k′m′}. Exp(ψ̂_{kk′mm′}) represents the ratio of the difference in the average mortality risk for time periods m and m′ for region k’ to the difference in the average mortality risk for time period m and m′ for region k, when other covariates in the model are held constant. A positive value indicates that the difference in the relative risk of mortality for region k’ is greater than the difference in relative risk for region k.
Hypotheses of average differences for multiple study years for two regions may also be of interest. For example, one might test the null hypothesis that the average mortality risk for region k is equivalent to that of region k′ for years m_{1} and m_{2,} and years m_{3} and m_{4}. Using the contrast of equation 8 as a guide, specification of the null hypothesis and formation of the contrast should be apparent.
CASE EXAMPLE
The example to illustrate the application of a GLM framework with GEEs to test for differences in mortality between regions over time is based on 15 years of data for regional health authorities (RHAs) in rural Manitoba. Previous research indicates that standardised rates of premature mortality (that is, death before age 75) were higher in the northern region of the province (that is, the region north of the 53rd parallel) than in the southern and central regions during this time period. There is also some suggestion that the rates in the southern and central regions decreased over this period of time while the rate for the north either remained static,^{15} or increased slightly.^{16} The premature mortality rate (PMR) is a well recognised indicator of population health status and a population’s need for health care.^{17–}^{19} RHAs where PMR is higher than the provincial average generally have poorer population health status, whereas RHAs with PMRs lower than the provincial average generally have better population health status. The standardised PMR data provide preliminary descriptive evidence that the magnitude of the difference in mortality for the northern and southern regions of the province is growing over time. A model based analysis was used to test this assertion for both all cause and cause specific mortality. The hypothesis of interest was that the difference between the northern region (that is, the RHAs with the highest PMR) and the southern region (that is, the RHAs with the lowest PMR) increased between two time periods: 1985–1989 and 1995–1999.
Study data were obtained from Vital Statistics records and the Manitoba Health Services Insurance Plan (MHSIP) Registry. The former was used to compile inprovince deaths for all Manitoba residents on an annual basis for the period from 1 January 1985 to 31 December 1999. All cause mortality was examined in addition to specific causes identified using the 9th version of the International Classification of Diseases. These included cancer (ICD9 140–208), heart disease (ICD9 390–459), injury (ICD9 E800–E999), and respiratory disease (ICD9 460–519). The MHSIP Registry was used to obtain population counts for regions. Ethics approval for data access was obtained from the Health Research Ethics Board of the University of Manitoba; the Manitoba Health Information Privacy Committee was kept informed of the research.
The data were classified into strata on the basis of age group, sex, and region. For all cause mortality, five year age groups were chosen to describe the effect of age on mortality: 0–4, 5–9, ... 80–84, 85 and older.^{13,}^{20} For cause specific mortality, because of the small numbers of deaths in some of the five year age strata, the data were instead assigned to four age groups: under 45 years, 45–64, 65–74, and 75 years and older. The 11 rural RHAs of Manitoba were classified into three health status regions based on the work of Roos et al.^{16} North or highest PMR (NorMan, Burntwood, Churchill), Central or average PMR (Marquette, Parkland, North Eastman, Interlake), and South or lowest PMR (Central, South Eastman, South Westman, Brandon). This classification is based on standardised PMR rates at the start of the study period; each region of the province is comprised of a set of geographically contiguous RHAs.
Key points

Regional differences in population health status are increasing in many jurisdictions.

To test hypotheses about the magnitude of these differences over time, researchers can define a generalised linear model for correlated data and one or more linear contrasts on the regression model coefficients.
To model the covariance across time an exchangeable structure was adopted; it assumes equivalent correlations between successive study years. This structure seemed the most appropriate given descriptive analyses of the data that showed a relatively constant degree of correlation across successive study years, and previous research.^{16} An unstructured covariance matrix could not be fitted to the data because of the large number of parameter estimates required, and a first order autoregressive structure did not produce appreciable differences in the standard errors of the regression coefficients. Close agreement between the true covariance structure and V_{i} is desirable from the point of view of efficiency of estimation but is not necessary for consistency of estimation.^{15}
The final model contained the main effects of age group, sex, region, year, and the region x year interaction. All predictor variables were categorical. Additional two way interactions were considered in preliminary models, but were not retained because they did not improve the fit of the model and were not statistically significant predictors of mortality. Our goal was to fit the most parsimonious model given the research objectives. The GENMOD procedure of SAS was used to perform all analyses.^{21}
Table 1 summarises the number of deaths for each year and each of the three regions of the province for the 15 year study period for three age groups: 0 to 44 years, 45 to 74 years, and 75 years and older. The population, and hence the number of deaths, is substantially lower in the northern region of the province than in the southern region. Random variation in the age and sex standardised rates, particularly for the north, makes it difficult to determine whether the difference between these two regions of the province is increasing or decreasing over time (see fig 1).
The results for the Poisson regression analyses are presented in figure 2 and in table 2. Figure 2 graphically displays the difference in the relative risk of all cause mortality between the North and South regions for each year of the study period. These results were obtained by applying the contrast of equation 6 to the data for each of the 15 study years. They suggest a difference between the two regions that may be increasing in magnitude over time.
We tested the null hypothesis of no difference in the relative risk between the two regions in the first and last five years of the study period, that is, H_{0}: ψ_{SN(1985–89)(1995–99)} = 0, where
The results in table 2 indicate a statistically significant result was obtained for all cause mortality and for heart disease mortality, but that no differences were detected for cancer, injury, and respiratory disease mortality. These results suggest that regional differences in heart disease mortality have contributed to the increased disparity in overall mortality between the northern and southern regions of the province over time.
DISCUSSION
The purpose of this research was to present a model based framework for correlated data that can be used to test for differences in mortality between geographical areas over time. This framework provides one analytical tool for examining small area inequalities in health status.
The GLM with GEEs was used to describe the relation between counts of the number of deaths in population strata and the explanatory variables of age, region, and time. By including age as an explanatory variable in the model, the researcher accounts for differences in the demographic structure of small areas that may affect the number of deaths.
Random variation in the number of deaths for small areas means that mortality rates may fluctuate substantially from one year to the next. This may mask the identification of trends in the data or differences in mortality between regions. The adoption of significance tests within the GLM framework permits the researcher to test whether differences between regions are in fact attributable to these chance fluctuations. As this paper has shown, the interaction between geography and time must be probed to determine whether the magnitude of the differences between regions is increasing or decreasing.
While the application of a Poisson model to mortality data is well reported in the literature, there is less information on using GEEs to incorporate information on the correlation structure of the data. The correct standard error for the contrast test statistic is obtained when information on the association between successive points in time is modelled.
There are a number of hypotheses that can be examined within the context of this framework. The mortality rate for one region could be compared with the average rate for two or more different regions. Hypotheses of differences between individual study years, average differences between two or more years, or of linear trends across time periods may be tested when time is treated as a categorical variable in the model. When time is specified as a continuous variable, the regression coefficient represents the average change per unit of time. Differences between regions may be tested, but not differences between time periods. Time may also be treated as a continuous variable, but in a piecewise fashion. A categorical variable is used to distinguish successive time periods (that is, before and after periods). Hypotheses of differences between periods in the average change per unit of time are tested. Such an approach has been adopted in both clinical and epidemiological research to examine critical phases of change.^{2,}^{22} The choice among these approaches will depend on the objectives of the research.
Finally, such a framework based on the GLM can be applied to other measures of population health status, including incidence or prevalence of chronic health conditions such as diabetes or respiratory illness.^{23} To extend the framework, the researcher must be able to identify the distribution of the data, the link function that relates the expected value of the dependent variable to the independent variables using a linear model, and the correlation structure of the longitudinal data. As well, the researcher must be able to articulate one or more hypotheses regarding differences between regions and time periods.
Acknowledgments
The results and conclusions are those of the authors; no official endorsement by Manitoba Health was intended or should be implied. We are indebted to Health Information Services, Manitoba Health, for providing data. Thanks to JoAnne Baribeau for manuscript preparation.
REFERENCES
Footnotes

↵* We have removed the regression coefficients for age from the contrast specification when this explanatory variable is held constant.

Funding: this research was supported by a grant to the first author from the Canadian Institutes of Health Research (CIHR; no FRN62336), a CIHR New Investigator award to the third author, and a grant to the fourth author from the Canadian Population Health Initiative.