Article Text

Download PDFPDF

Can data from primary care medical records be used to monitor national smoking prevalence?
  1. Lisa Szatkowski1,
  2. Sarah Lewis2,
  3. Ann McNeill2,
  4. Yue Huang2,
  5. Tim Coleman1
  1. 1Division of Primary Care, Queen's Medical Centre, UK Centre for Tobacco Control Studies, University of Nottingham, Nottingham, UK
  2. 2Division of Epidemiology and Public Health, Nottingham City Hospital, UK Centre for Tobacco Control Studies, University of Nottingham, Nottingham, UK
  1. Correspondence to Lisa Szatkowski, Division of Primary Care, Queen's Medical Centre, UK Centre for Tobacco Control Studies, University of Nottingham, Derby Road, Nottingham NG7 2UH, UK; lisa.szatkowski{at}


Background Data from primary care records could potentially provide more comprehensive population-level information on smoking prevalence at lower cost and in a more timely fashion than commissioned national surveys. Therefore, we compared smoking prevalence calculated from a database of primary care electronic medical records with that from a ‘gold standard’ national survey to determine whether or not medical records can provide accurate population-level data on smoking.

Methods For each year from 2000 to 2008, the annual recorded prevalence of current smoking among patients in The Health Improvement Network (THIN) Database was compared with the ‘General Household Survey (GHS)-predicted prevalence’ of smoking in the THIN population, calculated through indirect standardisation by applying age-, sex- and region-specific smoking rates from the corresponding GHS to the THIN population.

Results Completeness of smoking data recording in THIN improved steadily in the study period. By 2008, there was good agreement between recorded smoking prevalence in THIN and the GHS-predicted prevalence; the GHS-predicted prevalence of current smoking in the THIN population was 21.8% for men and 20.2% for women, and the recorded prevalence was 22.4% and 18.9%, respectively.

Conclusions The prevalence of current smoking recorded within THIN has converged towards that which would be expected if GHS smoking rates are applied to the THIN population. Data from electronic primary care databases such as THIN may provide an alternative means of monitoring national smoking prevalence.

  • Smoking
  • primary healthcare
  • medical records
  • smoking RB

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


Tobacco smoking remains the most preventable threat to public health worldwide,1 with at least half of all smokers, and possibly as many as two-thirds, dying prematurely as a result of their behaviour.2 Many governments have set smoking prevalence targets; the Department of Health, for example, aims to reduce the prevalence of smoking among adults in England to 10% or less by 2020.3 In many countries, the main source of smoking prevalence statistics for monitoring progress towards meeting these targets are national annual surveys, such as the General Household Survey (GHS) in Britain (renamed the General Lifestyle Survey in 2008), which samples approximately 15 000 adults each year.4 Arguably, there are limitations to the use of the GHS in its current format as a ‘gold standard’ for monitoring smoking prevalence changes: it is not adequately powered to detect small annual changes in smoking prevalence,5 particularly at regional level,4 and results are not published until at least a year after survey completion. Although the new Integrated Household Survey6 will question over 500 000 adults annually and publish results within 6 months, this will be even more expensive than the GHS. Primary care records contain routinely collected information on patients' smoking and could, potentially, provide comprehensive population-level data on smoking prevalence at lower cost and, therefore, their use should be investigated.

In Britain, all members of the population are entitled to register with a general practitioner (GP), and care is free at the point of access. Thus, smoking data recorded in primary care may be available for potentially the whole population, and such data may be useful for monitoring smoking prevalence. Previous studies have highlighted that primary care smoking data in Britain are incomplete; in 1996, primary care records within the General Practice Research Database could identify only 80% of the expected number of smokers at that point in time.7 However, the introduction of a voluntary pay-for-performance general practice contract in 2004,8 taken up by almost all GPs,9 has incentivised and improved the recording of patient smoking status.10 ,11 GPs in Britain are now, for example, required to record their patients' smoking status at least every 27 months (every 15 months for patients with specified chronic diseases). Consequently, we have compared the prevalence of smoking derived from a subset of UK practices that contribute to a large dataset of primary care electronic medical records (The Health Improvement Network (THIN)) with the prevalence from the GHS to illustrate whether or not primary care data from this particular dataset can be used to monitor national trends in smoking behaviour. We also consider the implications of our findings for the potential utility of routinely collected primary care data, which are not collated within research databases such as THIN.


THIN is a database containing the primary care medical records of over 6 million patients from 446 general practices throughout the UK, all of which use the INPS Vision12 practice management system. The dataset represents approximately 6% of the UK population13 and is broadly nationally representative in terms of patient demographic characteristics (figure 1). THIN is, however, slightly less representative of more deprived social groups; each year, approximately 5% fewer deaths are recorded in THIN than expected, which may be attributable to over-representation of more affluent, healthy patients.15

Figure 1

Comparison of the structure of The Health Improvement Network (THIN) population on 1 July 2008 with the Office for National Statistics (ONS) mid-2008 UK population estimate.14

To enable comparison of smoking prevalence estimates from THIN with those of the British GHS, THIN practices in Northern Ireland were excluded from this study. For each year from 2000 to 2008, all patients were identified from the THIN dataset who were older than 16 years and registered with a practice on an index date of 1 July of that year. Patients who registered with a practice within the previous 3 months were excluded from this analysis (the GP contract requires that the smoking status of newly registering patients is recorded within 3 months for this recording to be financially rewarded).

Each patient's year of birth and gender was identified, as well as the Strategic Health Authority within which their GP surgery was located. The prevalence of smoking each year was calculated from the data recorded in medical records. All records of smoking status, identified by relevant Read Codes (a hierarchical dictionary of medical nomenclature16), entered into a patient's notes on or after their registration date were extracted. Patients were classified as current smokers at a given index date if their most recent smoking-related entry in their medical records prior to this index date identified them as such. All patients whose most recent Read Code did not indicate that they were a current smoker, as well as those patients with no smoking information recorded in their notes, were assumed not to be current smokers at that given point in time. Previous authors have shown that the majority of patients with missing smoking records in THIN and the General Practice Research Database are either ex- or non-smokers,7 ,17 and thus, we feel that this assumption is valid. The recorded prevalence of ex-smoking and never smoking each year was also calculated, identifying patients whose most recent smoking Read Code implied these smoking behaviours.

We wanted to compare the adequacy of smoking data recording within THIN with that obtained via the GHS. However, a direct comparison of these data sources would not have been appropriate because THIN has a slightly different demographic structure to the national population; even if THIN contained valid smoking data for all patients within this database, one would expect smoking rates based on THIN data to differ from GHS estimates of national smoking prevalence. Therefore, we used the following standardisation procedure to calculate what the smoking prevalence among THIN patients might be if THIN did have the same demographic structure as the British population (called ‘GHS-predicted prevalence’) and compared this with GHS figures. For each year between 2000 and 2008, region-, age group- and sex-specific rates of current, ex-smoking and never smoking were identified from the relevant GHS, weighted for non-response to give nationally representative indicators of smoking behaviour. These rates were applied to strata of the THIN population (similarly defined by age group, sex and region) at each index date using indirect standardisation18 to produce annual GHS-predicted prevalence estimates for current, ex-smoking and never smoking; these predicted prevalence estimates were then compared with the recorded prevalence figures.

To investigate variations in the recording of current smokers between practices, the expected prevalence of current smokers in each practice was calculated in the manner described above, again using age group, sex and government office region as variables in the standardisation procedure. These predicted prevalence estimates were then compared with the proportion of patients in each practice recorded in their notes as current smokers.

All analyses were completed using STATA V.11.0 (STATA Corp).


Patient characteristics

Over 2 million patients aged ≥16 years were alive and registered with a THIN practice at each index date, of whom 49% were men, with a mean age of 48 years (IQR 33–61). The average number of years of medical records available for inspection for each patient increased from 13.9 years for the 2000 population to 15.6 years in 2008. Figure 2 shows the proportion of patients each year for whom, after having inspected all data recorded since their registration with a THIN practice, it was impossible to assign a smoking status.

Figure 2

The proportion of The Health Improvement Network patients with no smoking status recorded in their medical records (all patients aged ≥16 years).

In 2000, 40.5% men and 28.9% women had no smoking status recorded in their notes since registering with their practice, improving to 15.1% and 8.0% respectively in 2008. In all years, the percentage of patients with no smoking status recorded, as well as the difference in the amount of missing data between men and women, is smaller in older age groups.

The currency of patients' last recorded smoking status has improved over time. In 2000, just 27.0% of patients had their smoking status last recorded within the 2 years prior to the index date of 1 July and 56.7% within 5 years. By 2008, these figures had improved to 45.9% and 68.0%, respectively.

Prevalence of current smokers

Figure 3 shows changes over time in the GHS-predicted prevalence of current smoking compared with the actual recorded prevalence in THIN.

Figure 3

General Household Survey-predicted and recorded adult smoking prevalence in The Health Improvement Network.

The GHS-predicted prevalence of current smoking in the THIN population has declined during the study period, and the actual recorded prevalence of current smoking has converged towards the GHS-predicted prevalence. In 2000, smoking prevalence rates derived from recorded THIN data were 19.9% for men and 19.4% for women (69.6% and 78.0% of the GHS-predicted prevalence respectively). Completeness of recording of smoking data has improved, such that, by 2008, actual male recorded smoking prevalence was 22.4% and GHS-predicted prevalence was 21.8% (for women, the figures were 18.9% and 20.2% respectively). These figures are almost identical to the unstandardised national prevalence estimates derived from the GHS (men 22.5%, women 20.3%). There are variations in the completeness of recording by age, most notably large shortfalls in the recording of current smokers among men between the ages of 16 and 25 years (see supplementary information).

These national figures disguise significant variations between general practices. For men and women combined in 2008, the worst performing practice recorded just 48.0% of the expected number of current smokers, while on the other hand, one practice identified 193.0% of the expected number of smokers (IQR: 82.1%–114.5%).

Prevalence of ex- and never-smokers

The recorded prevalence of ex- and never-smokers in THIN is less complete than the recording of current smoking, though has improved over time. In 2000, 35.9% of the GHS-predicted number of ex-smokers were identified from medical records, improving to 79.8% in 2008 (for never-smokers, figures were 75.0% and 93.0% respectively). For ex-smoking and never smoking, the completeness of recording was greater in women than men.


The national smoking prevalence estimates derived from the primary care electronic medical records in THIN have been comparable to those produced by the current gold standard, the GHS, since 2006. This suggests that using the THIN dataset as either an alternative means of monitoring national smoking trends in Britain or to complement national survey data would be valid. The demonstrated comparability between THIN and GHS data suggests that routinely collected smoking information held in all general practices throughout the UK (not just those contributing to THIN) may be useful for monitoring local trends in smoking prevalence, though further work is needed to evaluate the validity of these data for this particular use.

There are several advantages to using THIN over national survey data to monitor national smoking trends—THIN is larger, is released three or four times annually and has a lag of only 3–8 months before clinical data become available. The standard error of the national smoking prevalence estimate derived from THIN is considerably smaller than that derived from the GHS (eg, 0.26 in THIN in 2008 compared with 0.48 in the GHS5), and thus, THIN can potentially provide more precise smoking prevalence estimates nationally and at the level of government office regions. THIN is also much larger than the new Integrated Household Survey and thus again is likely to be able to provide more accurate estimates of smoking prevalence than this new survey. Two postcode-level indicators of deprivation, Townsend Index and Mosaic type, are attached to each patient's records in THIN, and thus, the dataset could also be used to monitor progress towards reducing socioeconomic inequalities in smoking prevalence and health. Also, health-related information may be more complete in primary care data than survey responses, offering a valuable opportunity to use primary care data to investigate relationships between smoking and health outcomes. Conversely, surveys can provide objective detailed contextual information about smoking behaviour, such as the numbers of cigarettes smoked and of quit attempts made; such data may not be available in primary care data, and both data sources are likely to prove complementary. Datasets of primary care records from other countries are available to researchers,13 and similar methods to those employed here could be used to assess the completeness of smoking status recording and the utility of such data for monitoring national smoking prevalence in these countries.

Our assumption that all patients with no smoking status recorded in their THIN records are not current smokers may lead to the underestimation of smoking prevalence, though, as noted already, other work suggests that this assumption is valid.17 Similarly, that a substantial minority of patients′ most recent smoking status was recorded several years before the index may also bias our estimates of the prevalence of current smoking, as some smokers may have since quit and vice versa. However, this is perhaps not a problem in older patients recorded many years previously as never-smokers, as very few people begin smoking after the age of 25 years.4 It is recognised that reliance upon self-reported measures of smoking behaviour in national household surveys such as the GHS may underestimate smoking prevalence, particularly among younger age groups. Although 16- and 17-year-olds complete the GHS questionnaire in private, this is unlikely to be totally successful in encouraging honest answers, and rates of under-reporting might not be constant over time,4 especially given reductions in the social acceptability of smoking. However, if patients misrepresented their smoking behaviour to their doctor, there could also be a degree of under-reporting in primary care data. Observed individual-level agreement between smoking status recorded in patients' medical records and that ascertained through questionnaires suggests that there are minimal data entry errors in primary care records.19 However, the lack of biochemical data to validate patients′ self-recorded smoking status in THIN (and similarly in the GHS) means that we cannot be sure whether smoking status records in either data source are a true reflection of reality. It is unlikely, however, that validated smoking outcomes would ever be used routinely in national population surveys due to the expense incurred.

Young adults are slightly less well represented in the THIN dataset, perhaps because this age group are least likely to be registered with a GP. In addition, the practices contributing to THIN marginally under-represent those serving more deprived populations.15 Smoking prevalence is highest among young adults and lower socioeconomic groups,4 and therefore use of THIN data may slightly underestimate national smoking prevalence, but the comparability between THIN-recorded smoking prevalence and the published national prevalence estimates from the GHS suggests that this is not a major issue. However, the differences between predicted and recorded smoking prevalence in some population subgroups (see supplementary information) suggest that primary care data may not be able to provide accurate prevalence estimates for such subgroups, such as young men. This is perhaps because young men visit their GP relatively infrequently,4 and thus GPs have fewer opportunities to record their smoking status.

General practices that contribute to the THIN dataset undergo assessment to ensure they are using their computer systems correctly, and thus they may not be representative of all British practices. The substantial variation in the completeness of recording in individual practices warrants further investigation and may, at least in part, be explained by differences in the social class structure of their patient populations. The lack of a comparable indicator of social class in the GHS and THIN data means that this could not be used as a variable in the standardisation procedure, though part of the effect of social class is likely to be accounted for by using government office region as a standardisation variable.

A major government-commissioned report in England, the Wanless report, called for the organisations responsible for delivering community health services to make better use of data from primary care to help understand the prevalence of disease risk factors within their local populations.20 THIN practices are not identifiable at a geographical level finer than that of government office regions, and so THIN data cannot be used to monitor local smoking prevalence. Given the observed variation in THIN practices' recording of smoking status, it is not clear whether data from all British practices could be accurate enough for local smoking prevalence monitoring; further research is required. However, the methods used in this study provide a way to compare smoking prevalence estimates from local surveys with data recorded in local GP practices to assess the quality of recorded smoking information. It is possible that practice data in some localities would not be sufficiently complete and, in such areas, practices may need support to optimise their recording of smoking status before their data could be used to monitor smoking prevalence locally. However, once the quality of smoking status recording is deemed acceptable, data from all practices throughout the UK may offer a less costly means of monitoring smoking prevalence than commissioned surveys.

In conclusion, this study shows that primary care medical records data from THIN may be useful for monitoring national smoking prevalence in Britain, and the strengths of the THIN dataset mean that it could potentially complement the use of national surveys. Further work is needed to determine whether primary care data from practices not included in the THIN database are of sufficient quality for monitoring trends in smoking prevalence and, in particular, whether such data are appropriate for monitoring prevalence in smaller localities.

What is already known on this subject

  • The availability of accurate up-to-date estimates of smoking prevalence is vital to allow the effectiveness of tobacco control policies to be monitored.

  • Smoking data recorded by GPs in primary care could possibly be used to monitor smoking prevalence, though historically the quality of recorded smoking information was poor.

What this study adds

  • The completeness of smoking status recording in primary care in the UK has improved over time, likely driven by financial incentives requiring GPs to regularly record this information.

  • Data from primary care may now be suitable for monitoring national smoking prevalence. It is available at relatively low cost and in a timely fashion so could complement data collected in national surveys.


View Abstract


  • Funding LS is supported by a Cancer Research UK PhD Studentship (grant number A9166). TC, SL and AM are members of the UK Centre for Tobacco Control Studies, a UK Clinical Research Collaboration (UKCRC) Public Health Research: Centre of Excellence. Funding from British Heart Foundation, Cancer Research UK, Economic and Social Research Council, Medical Research Council, and the National Institute for Health Research, under the auspices of the UK Clinical Research Collaboration, is gratefully acknowledged. The original THIN primary care data were provided by the Epidemiology and Pharmacology Information Core (EPIC) (, and the data for this study were made available through the National Prevention Research Initiative (NPRI, Grant number: G0701100. Relevant NPRI funding partners: British Heart Foundation; Cancer Research UK; Department of Health; Diabetes UK; Economic and Social Research Council; Medical Research Council; Research and Development Office for the Northern Ireland Health and Social Services; Chief Scientist Office, Scottish Executive Health Department; The Stroke Association; Welsh Assembly Government and World Cancer Research Fund.

  • Competing interests None.

  • Ethics approval This study was approved by the Leicestershire and Rutland Research Ethics Committee.

  • Provenance and peer review Not commissioned; externally peer reviewed.