Background Clinical prediction models are used for different purposes, but purpose-specific validation is not usually carried out. The ability of a model to discriminate between true positives and false positives has applications in clinical decision making, screening, and service evaluation. The calibration (goodness-of-fit) of a model is a key indicator of how well a model’s predicted outcomes reflect those actually observed. Initial validation of models usually includes assessment of these features but re-evaluation over time might not be performed.
EuroSCORE is an adult cardiac surgery risk model which has been in use since 1998. It predicts in-hospital mortality and is used for clinical decision making and service evaluation. It is widely acknowledged to have demonstrated ‘calibration drift’, but this has not been formally evaluated in the UK population.
Methods We assessed the performance of EuroSCORE in the Central Cardiac Audit Database (CCAD), covering all NHS cardiac procedures in the UK. Discrimination was tested using the area under the Receiver Operator Characteristic (ROC) curve (AUC). Calibration was assessed with the Hosmer-Lemeshow goodness of fit test. In addition, we developed new models with longer-term outcomes using the data, and tested year-on-year model performance.
Results A total of 399,314 eligible procedures from 1st April 1998 to 31st March 2011 were included in the analysis. Assessing the discrimination of EuroSCORE by financial year showed consistency across the period (AUC values ranging from 0.788 to 0.818). Model calibration, however, drifted considerably with a cumulative mortality over-estimate of 10,801 deaths by the end of the period (increasing from 147 over-estimated deaths in 1998 to 1,500 in 2010). This represented a predicted overall mortality rate of 6.0% compared with the observed rate of 3.4%. We will also present findings relating to year-on-year performance of a panel of models tailored to longer-term outcomes in specific procedures.
Conclusion Models that retain accurate discrimination while undergoing calibration drift may be implemented in settings for longer than is appropriate. A model that maintains good discrimination may be useful in a subset of scenarios, but for most purposes good calibration is also crucial. For models developed for multiple applications, purpose-specific validation and recalibration should be considered. Model performance should be appraised in context and not by indicators in isolation.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.