Changing effect of the numerator–denominator bias in unlinked data on mortality differentials by education: evidence from Estonia, 2000–2015

Background This study highlights changing disagreement between census and death record information in the reporting of the education of the deceased and shows how these reporting differences influence a range of mortality inequality estimates. Methods This study uses a census-linked mortality data set for Estonia for the periods 2000–2003 and 2012–2015. The information on the education of the deceased was drawn from both the censuses and death records. Range-type, Gini-type and regression-based measures were applied to measure absolute and relative mortality inequality according to the two types of data on the education of the deceased. Results The study found a small effect of the numerator–denominator bias on unlinked mortality estimates for the period 2000–2003. The effect of this bias became sizeable in the period 2012–2015: in high education group, mortality was overestimated by 23–28%, whereas the middle education group showed notable underestimation of mortality. The same effect was small for the lowest education group. These biases led to substantial distortions in range-type inequality measures, whereas unlinked and linked Gini-type measures showed somewhat closer agreement. Conclusions The changing distortions in the unlinked estimates reported in this study warn that this type of evidence cannot be readily used for monitoring changes in mortality inequalities.


INTRODUCTION
Monitoring socioeconomic inequalities in mortality is a crucial component for designing appropriate policies promoting more sustainable health development. 1 2 However, producing reliable evidence about the magnitude and changes in mortality inequalities requires precise register-based or census-linked data. Such data covering entire populations are still missing for many developed countries. A widely used alternative in these cases is relying on cross-sectional unlinked data based on separate tabulations of deaths and population exposures by socio-economic groups. The major problem with unlinked data is the disagreement between the sources of information on death and census records. [3][4][5][6] The socio-demographic information provided on death certificates is considered as lower quality due to a higher probability of misreporting by proxy informants. 7 8 The mismatch in the sources of information establishing numerators and denominators of death rates may lead to distortions of aggregated mortality and inequality estimates.
Matching studies checking the validity of sociodemographic information on death records are scarce. [9][10][11][12][13] To our knowledge, the only evidence on the importance of numerator-denominator bias in Eastern Europe comes from two studies on Lithuania. 10 13 These studies found a substantial misreporting of education and ethnicity on death records leading to biases in group-specific mortality and failing to report the gradient of inequality correctly.
This study extends prior evidence about the numerator-denominator bias in unlinked data by providing new evidence based on the data for Estonia with a special focus on the change in the size of the bias in time. In addition, the current study broadens the scope of previous analyses by performing systematic sensitivity testing of a wider range of inequality measures.

DATA AND METHODS
This study uses an aggregated census-linked mortality dataset provided by Statistics Estonia. These data were compiled from longitudinal mortality followup studies based on the 2000 and 2011 censuses. All permanent residents of Estonia taking part in both censuses were followed from the census dates (31 March 2000 and 31 December 2011) until the date of death or end of the follow-up period (31 December 2003 and 31 December 2015, respectively). Of all death records, 95-98% were successfully linked to the preceding censuses. For the analyses, the data were organised into two periods (2000-2003 and 2012-2015). The age-specific population exposures by education used to calculate both census-linked and unlinked mortality estimates were estimated by aggregating person-years lived by individuals during the period of observation (also accounting for the change in exact age within each year of observation). Meanwhile, deaths were grouped according to the exact age at death.
For linked estimates, education of the deceased was derived from the census at the beginning of follow-up. For unlinked estimates, education of the deceased stemmed from death records. For both linked and unlinked estimates, education-specific person-years of exposure were calculated according to the census information on education and subsequent follow-up information. The original educational coding in these variables was reclassified using the three broad International Standard Classification of Education (ISCED)11 categories: (1) low education combining primary and lower secondary education (ISCED11 categories 0-2); (2) middle education combining upper secondary and postsecondary non-tertiary education (ISCED11 categories 3-5); (3) high education referring to tertiary education (ISCED11 categories [6][7][8]. For a better match with death records, ISCED11 category 5 was combined with middle education. The per cent of missing education was very low for both the census and death record information (0-0.8%) with the exception of unlinked deaths for the period 2012-2015 (missing education was observed for 13% of death records for males and 14% of death records for females). For this period, deaths and person-years of exposure with unknown education were redistributed using a conservative approach assuming a proportional distribution across the three educational categories (online annex table 1). In all the remaining cases, negligible numbers of deaths and exposures with unknown education were excluded from the analyses.
Education-specific mortality for males and females was measured by age-standardised death rates (SDRs) using the WHO European Population (1976) as a standard. Relative mortality differences were assessed using age-adjusted Poisson regression mortality rate ratios (MRRs). More advanced numerically calculated inequality measures (average intergroup difference (AID) and Gini coefficient)) were applied to account for the total amount of inequality across all educational groups and groupspecific weights in the population (online annex table 2). 14 15 Regression-based inequality measures (Slope Index of Inequality (SII) and Relative Index of Inequality (RII)) were calculated using common algorithm described by Anand et al. 14 The public health impact of inequality was estimated using population-attributable fractions (PAFs). 14 Table 1 provides aggregated education-specific mortality estimates by education given on census and death records in Estonia in the periods 2000-2003 and 2012-2015. The results reveal quite a small effect of the numerator-denominator bias on unlinked mortality estimates in the first period and a pronounced discrepancy between the linked and unlinked education-specific mortality estimates in the second period. The high education group showing overestimation of mortality by 23-28% in the period 2012-2015 was the most affected. Meanwhile, the unlinked SDRs for males and females with middle education for the same period were affected by the underestimation of mortality. The most striking case concerns females at age 65 with middle education in the period 2012-2015 showing lower mortality than among females with high education. The discrepancies were surprisingly small for the lowest education group except for females aged 30-64 years in the period 2012-2015 (table 1).

RESULTS
The observed biases in education-specific unlinked mortality estimates predetermined substantial distortions in the corresponding range-type measures of mortality inequality (table 2). For both males and females aged 30 years and over, the maximal absolute difference in SDRs according to the unlinked data was significantly underestimated, especially for females in the second period. Meanwhile, MRRs were quite similar for the period 2000-2003 and remarkably different for the period 2012-2015. The most significant distortion in the unlinked MRRs was observed for females with middle education leading to the artificial advantage of this group against the highest education group. In all the remaining cases, the MRRs based on unlinked data for 2012-2015 were notably lower than those derived using linked data.
We found that using numerically calculated inequality measures (AID and Gini) accounting for mortality rates and population weights for each educational group may lead to a somewhat closer agreement between the linked and unlinked inequality measures. The biggest difference was detected comparing AID and Gini coefficients for females aged 30-64 years. In this case, underestimation of total mortality variation by education using unlinked data was about 20%. The corresponding disagreement was much lower for males in the same age group and both sexes at ages 30+ and 65+. Interestingly, similar regression-based inequality measures (SII and RII) show more pronounced discrepancies. Our final comparison examining PAFs warns that population-based mortality burden due to educational inequalities estimated according to unlinked data was vastly undercounted in the second period.

DISCUSSION
The study found that the growing effect of misreporting of education on death records in Estonia had a substantial impact on the decreasing quality of education-specific mortality estimates based on unlinked data. This bias was also responsible for distortions in the magnitude and even direction of change in Short report mortality inequalities. This finding is a warning sign against using unlinked estimates for monitoring changes in mortality inequality. A slightly better agreement was achieved using more advanced numerically calculated Gini-type measures of inequality (except for females aged 30-64 years). The advantage of the AID and Gini coefficient is probably related to a very good agreement between the unlinked and linked SDRs for the lowest educational group showing larger population weights. The observed distortions in education-specific mortality estimates derived from the unlinked data using death record-based information about education can be attributed to a variety of changeable factors. First, notable discrepancies may occur due to differences in the design and wording of questions on education in both the census and death records. 10 As in other countries, census questions in Estonia were more detailed and better suited to classify own education within different educational systems functioning during various historical periods. This design contrasts to less detailed questions available on death records.
Differently from death records, the census records also specify the entry-level for each educational level. Studies suggest that reported information on death records may depend on the sociodemographic characteristics of proxy informants and the deceased. 5 10 For example, the Lithuanian study shows that misreporting of education increases with age and is more frequent among those dying from alcohol-related or external causes of death and non-married individuals and Russian, Polish and other ethnic groups. 10 Although self-reported education in the census is also prone to reporting errors, using the same source of information for both the deceased and population exposures allows to avoid the well-known numerator-denominator bias. [3][4][5][6] One of the main reasons for the changing bias in the unlinked mortality data for Estonia can be related to the spread of postsecondary non-higher education. It is possible that a substantial share of third-party informants assumed this category being a part of the tertiary (high) education. This misclassification would explain a notable overestimation of mortality in the high education group, as reflected by the unlinked data. Finally, the rise in the proportion of the unknown category from almost 0% to 13-14% in the period 2012-2015 suggests the decreasing quality of filling this information on death records. Applying a simple proportional redistribution of unlinked deaths across the three educational groups is a limitation of the study. However, sensitivity analyses have shown that applying such an assumption leads to more plausible results if compared to the alternative solution based on assigning all deaths with unknown education to the lowest educational category. We were not able to test more sophisticated multiple imputation methods requiring access to the individual-level data. Finally, this study used education to rank socio-economic groups and did not provide any insights into the causal impact of education on mortality.
The results of this study have important implications for interpreting past and emerging evidence on mortality differentials based on unlinked data. Our findings warn that small numerator-denominator bias observed at some point in time cannot guaranty the sustainability of such a pattern in the future. The misreporting of education seems to be country-specific, indicating that the numerator-denominator bias can take different forms in various contexts. This conclusion is supported by completely different evidence from Lithuania for the period 2001-2004, revealing a very important effect of the numerator-denominator bias on education-specific mortality estimates based on unlinked data. 10 Therefore, the finding suggesting that more advanced Gini-type measures are less prone to the numerator-denominator bias may reflect the Estonian specifics and do not apply to other countries. Scientific and policy efforts should be reinforced by informing policy-makers about the risks of using unlinked data and highlight the need for more reliable evidence based on the registry-or census-linked data.