Background Research using data from large population-based datasets is often hindered by the presence of non-trivial proportions of missing data. Numerous approaches for handling missing data are available, each of which make important assumptions regarding the mechanism by which the missing data occurred. Using a 2008 extract of the Scottish Care Information-Diabetes Collaboration (SCI-DC), a population-based register of patients with diabetes data, we compared the use of four methods for handling missing patient BMI data in a retrospective cohort study of the association between body mass index (BMI) at date of diagnosis of diabetes and all-cause mortality in patients with Type 2 diabetes.
Methods The appropriateness of selected missing data approaches were investigated by assessment of the likely missing data mechanism. Descriptive analyses and logistic regression were used to investigate whether there were differences in characteristics between people with BMI data available (n = 99,472) and those without BMI data available (n = 117,007). Complete case analysis (CCA), population-mean imputation, stochastic imputation and multiple imputation (MI) methods were applied to deal with missing data in the BMI variable. Cox proportional hazard model coefficients for the association between BMI and all-cause mortality were compared for each missing data method.
Results There were 41,555 deaths among the diabetes cohort between 2001 and 2008. Patients with missing BMI were considerably more likely to have an earlier year of diagnosis (OR before 1995 vs. After 2004 60.29 [95% CI 57.23, 63.51]) and be ‘Never’ smokers (Never vs. Ever 1.08 [1.06, 1.10]). Depending on the missing data method used, a U- or J-shaped relationship between patient BMI and all-cause mortality was observed amongst patients with diabetes. Results from CCA and MI were largely dissimilar amongst patients with a BMI 20 to <25 kg/m2 (CCA HR 1.25 [1.17, 1.33] vs. MI HR 1.01 [0.97, 1.06]). Similarly, MI attenuated the excess mortality observed amongst patients with a BMI of 40 to <45 kg/m2 (CCA HR 1.35 [1.20, 1.52] vs. MI HR 1.09 [1.01, 1.18]).
Conclusion Studies using routinely collected data are particularly susceptible to missing data. Initial analyses suggest that the choice of imputation method may strongly influence final model estimates and therefore a study’s conclusions. The selection of methods for handling missing data should be guided by careful preliminary analyses which investigate the most plausible mechanism of missingness.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.