Just one question: If one question works, why ask several?
- Correspondence to: Professor A Bowling Department of Primary Care and Population Sciences, University College London, Royal Free Campus, Rowland Hill Street, London NW3 2PF, UK;
- QoL, quality of life
- HRQoL, health related quality of life
- GHS, General Household Survey
- VAS, visual analogue scale
While shorter instruments are more limited than longer measures, they have obvious benefits for both research and policy in terms of reduced burden and costs, and ease of interpretation.
A frequently asked question by clinical investigators is why they should use a lengthy, multi-item measurement scale to assess patients’ perceptions of their health, or quality of life, when there is evidence that a measure containing a single, global question is likely to suffice. Researchers may not wish to use lengthy scales because their core questionnaires are already long, the patient group of interest is ill or frail, they wish to minimise the burden on the patient and on the research team, or they simply want a “snap shot” of a topic rather than comprehensive coverage. In such circumstances, single questions have the obvious advantage of brevity, of making fewer demands than multi-item measures on respondents and researchers. Single, global questions have long been used in population surveys to measure health status, quality of life (QoL), and health related quality of life (HRQoL). The two most popular single global health items are self rated health status and self reported limiting, longstanding illness.
SELF RATED HEALTH STATUS
The classic self rated health status item consists of asking respondents to rate their health as “excellent, good, fair, or poor”. Variations of this question have been used in surveys worldwide. Literature reviews on the conceptualisation and measurement of health published by Rand in the USA1,2 and an overview by Stewart and Ware3 reported citations of the self rated health item as early as early as the 1950s. For example, a version appeared in a US study of occupational retirement4 and in the US Federal Civil Defense Administration Survey, both in the 1950s.5 And a question asking people to rate their general health, followed by a broad question about ill health (including longstanding complaints) was also included in the British government Surveys of Sickness conducted between 1943 and 1952.6
Interest in using broad, subjective health items dates from the mid-20th century, and stemmed from the realisation that mortality was too insensitive to use as a health care outcome indicator in developed countries, that health has physical, mental, social and spiritual dimensions, and that patients’ perspectives of their health and health outcomes should be assessed. This was given impetus by the World Health Organisation’s abstract conceptualisation of health in its 1946 constitution as “a state of complete physical, mental and social wellbeing, and not merely the absence of disease and infirmity”,7 and by subsequent investigations of lay definitions of health, and variations in illness behaviour. Survey researchers found that single items measuring subjective health and wellbeing did not necessarily correlate with medical diagnoses, but the former were held to have greater validity in certain situations (for example, when predicting help seeking behaviour and health service use). Thus, while in the first half of the 20th century the focus of health measurement was often limited to the presence or absence of negative health states and functioning, during the last half of the century there was a shift in focus. There was a trend in survey research towards using the single global health status question to integrate the different dimensions of health emphasised in the WHO definition. Consistent with this changing emphasis, the British government’s General Household Survey (GHS) included a version of the single health status item asking respondents to rate their health from 1977 onwards, after deciding to broaden its emphasis from use of services in relation to chronic and acute illnesses, and check lists of symptoms, and towards subjective perceptions of health.6 The question, or a similar variant, was included in the US National Health Interview Survey (http://www.cdc.gov/nchs/nhis.htm) and US National Health and Nutrition Examination Survey (http://www.cdc.gov/nchs/nhanes.htm). Most OECD countries now conduct regular population health interview surveys that include this well known single item (http://www.oech.org/publications). It has also been used with satisfactory levels of validity and reliability in the developing world (for example, Tanzania).8
The item was used in the Rand health insurance experiment and medical outcomes study,3 and now forms part of the general health perceptions dimension in the most widely used multi-item, multi-dimensional health status measure of all, which was developed from the initial Rand measures—the short form-36 (SF-36) (in both the Rand (http://www.rand.org) and QualityMetric (http://www.sf36.org) versions. In the late 1970s, to increase the question’s discriminative ability, and because of the operation of “social desirability” or “optimism” bias (leading to most respondents to rate their health at the positive end of the scale), the developers of the SF-36 and others added a “very good” category in between the “excellent” and “good” response choices; the short form-8 (developed from the SF-36) also includes a “very poor” category at the other end of the scale (http://www.sf-36.org/demos/SF-8). The health status item is popular in social gerontology where the tradition has been to ask respondents to rate their health in relation to their age. This prevents older respondents from assessing their health with reference to younger age groups and thereby perceiving it to be suboptimal.
A substantial body of international research has reported the item to be significantly and independently associated with specific health problems, use of health services, changes in functional status, recovery from episodes of ill health, mortality, and sociodemographic characteristics of respondents.9–17 It is judged to be appropriate for use in population surveys. Investigators of the MacArthur field study of successful aging in the USA, for example, reported that self rated health (poor/bad ratings of health compared with excellent ratings) was a strong and significant predictor of mortality in the general sample, as well as in controlled analyses when the sample was divided into in healthy and less healthy cohort samples.17 The question has been shown to discriminate successfully between people in different ethnic groups in Britain (http://www.archive.officialdocuments.co.uk/documents/doh/survey99/hse99), between indigenous and non-indigenous Australians (http://www.abs.gov.au/ausstats/abs@nsf), and between Maori and other New Zealand subgroups (http://www.moh.govt.nz/moh.nsf), although it is unknown whether differences also reflect cultural variations in perceptions and reporting.
However, variations between surveys and nations in the wording of the item, and in the number of response categories, do limit comparative analyses and interpretations. Analysis of data from the Australian National Health Survey has shown that it does have some response instability when repeated in the same questionnaire (before and after other questions about health), although this might also reflect the biasing effect of question order.18 And interpretation of the item at an individual level varies, depending on the referent being used by the respondent. Some people refer to specific health problems and others refer to general physical functioning when replying to the question.19 Other research using anchoring vignettes (fixed descriptions of each response choice level, to increase consistency of respondents’ interpretations of them), has found that their use provides a powerful tool for adjusting for the influence of varying expectations on self ratings of health.20 This can improve comparison of results (for example, older and younger people with the same level of health might rank themselves differently on a health status scale because of varying expectations of health and ability by age).
OTHER POPULAR SINGLE ITEMS
The second most popularly used single item measures disability by asking respondents if they have any “longstanding illness, disability or infirmity”. Respondents who report positively are usually asked if this limits their activities in any way. The British GHS has included this item since 1972. Although the prevalence of longstanding illness has been shown to increase over the past three decades of the GHS, the pattern has fluctuated.21 Thus, while the question has been shown to be associated with health service use, mortality, other indicators of functioning and health, age, socioeconomic status, as well as self rated health,22 the question has posed an enigma for researchers when comparing international data and data over time. A review of the use of the question (and variations of it) reported that it produced estimates that were sensitive to question wording and question order effects, to the mode of data collection (for example, interviewer compared with self administered questionnaires), to the survey process (for example, the collection of data by proxy) and the sponsorship or contextual effects of the survey.23 It was concluded that estimates of disability using such subjective single item questions were less stable for people who were above, than below, state pension age; and unless surveys that use the same single item instrument follow identical survey procedures, the interpretability of any evidence of change over time is seriously compromised. If single item questions are to be used, then attention to clear, simple wording at their design stage is obviously essential.
The visual analogue scale (VAS) is another frequently used single item technique. The method uses lines, the lengths of which are taken to denote the continuum of some experience such as tiredness, pain, nausea, or anxiety. The lines are usually horizontal, 10 cm in length, with stops (“anchors”) at right angles to the line at both extremes, representing the limits of the experience being measured (for example, “severe pain” to “no pain at all”). The respondent places a cross on the line to indicate their state. A quality of life VAS (often called a “QoL uniscale”) is in widespread use, in which the respondent places a cross on a horizontal line to indicate their quality of life during a specified time period (anchored at each end from “lowest quality” to “highest quality”). There are many references in the literature to the high levels of reliability, validity, and sensitivity of this simple VAS technique, including its ability to discriminate between healthy and sick people, its sensitivity to the stages of the disease progress, and ability to predict mortality. Research with cancer patients has also shown that a single item QoL VAS has good to excellent levels of reliability and validity compared with multi-item measures.24
SINGLE COMPARED WITH MULTI-ITEM MEASURES
Single item measures can be used alongside multi-dimensional measures, and are useful as broad summary ratings of diverse aspects of respondents’ health, QoL, and HRQoL, especially where respondents might have improved on one domain (for example, physical functioning) but not on another (for example, mental functioning). They are also generally accepted as useful in the assessment of health transitions (for example, self assessments of health as “better, same, or worse”). It has been proposed that concepts such as health status, QoL and HRQoL, when used as outcome variables, are more appropriately measured with a global single item.25 This is because multi-domain measures confound the dimensionality of these concepts with the multiplicity of their causal sources. Thus, in order that predictor and component variables can be separated, such concepts need to be considered as unidimensional, but with multiple causes. The unidimensional indicator is then logically the dependent variable in analyses, and the predictor variables include the range of pertinent multi-dimensional scale variables (for example, social, psychological, functional ability, etc).
While the single item question can provide valuable information, it has the advantage of simplicity, and can be reliable and valid, it is at the expense of detail. More information may be required on different dimensions of health, QoL or HRQoL, than a single item can provide. Classic measurement theory holds that single items are at a relative disadvantage to multi-item measures because more items produce replies that are more consistent and less prone to distortion from sociopsychological biases, and this enables the random error of the measure to be cancelled out. Hence they are more stable, reliable, and precise.
The careful development work on health status batteries at Rand in the USA has shown that a well constructed multi-item scale (even with just 5–10 items) is more sensitive to changes in patients’ condition over time than any single item measure.26–28 In addition, multi-item measures can provide a complete profile of multidimensional phenomenon, and can yield information on changes within the individual dimensions measured by the scale (for example, physical functioning, psychological health), although at the cost of increased burden and the risk of asking irrelevant questions. Scales may be preferred to single items because their multiple items are suitable for statistical calculations using summed and weighted scores (for example, pain might be given twice the weight of mobility in the scale score, if it is judged to be twice as important). On the other hand, there is a body of literature in psychology that shows there is little to be gained by complex weightings over simple summated scoring methods.29
Few of the initially developed measures of health status or HRQoL were based on established methods of scale construction, although methods of scaling had been developed in the early 20th century (for example, Thurstone, Likert, and Guttman techniques of scaling), stemming, in particular, from the development of occupational and intelligence testing, and the scientific principles of measurement established by mathematical psychologists during the mid-20th century. These led to the establishment of rigorous methods of psychometric evaluation.30,31 Psychometric theory dictates that when a concept cannot be measured directly (for example, health status, QoL, HRQoL), a series of questions that tap different aspects of the same concept need to be asked. Items can then be reduced, using specific statistical methods, to form a scale of the domain of interest, and the resulting scale tested to ensure that it measures the phenomenon of interest consistently (reliability), that it is measuring what it purports to measure (validity), and is responsive to relevant changes over time. The satisfaction of these conditions is most probable when the resulting instrument contains several items to measure the concept of interest to permit testing for internal consistency and to minimise random measurement error. Although developed much earlier, these standards for measurement and scaling were little used in the health field until the 1970s. Thereafter, during the 1980s and 1990s, emerging patient based health status and HRQoL measures were notable for their length, sometimes containing well over 100 questions, their length being dictated by the rigours of the theories and methods of scaling and psychometrics. More recently, burdened by the length of such measures, investigators have welcomed the development of briefer measures. Hence there has been a proliferation of increasingly short versions of existing measurement instruments, and more efficient summary measurement scales for use in the burgeoning health outcomes sphere.28
One of the earliest and most extensive applications of psychometric theory and methods in the health measurement field, began in the 1970s with the refinement of health status measures for Rand’s health insurance study and medical outcomes study.3 One aim of the latter was to construct the best possible, and most efficient, scales for measuring a wide range of functioning and wellbeing. The Rand investigators also realised that new standards of measurement were needed because while traditional testing showed that longer measurement scales were more reliable and valid than shorter scales, they needed to consider respondent burden and the costs of data collection for their large scale studies. They saw the need to compromise between traditionally defined standards of psychometric excellence and newly identified standards of feasibility and practicality; and took as their starting point the issues of which concepts should be measured and how much measurement would be enough for the intended purpose. They attempted to achieve reductions in respondent burden without sacrificing measurement precision below a critical level, and their achievements are apparent when contrasting the number of items in the measuring instruments used in their health insurance experiment with the smaller number of items for their later medical outcomes study (for example, 25 items compared with 10 items to measure physical functioning). These methodological developments have continued, and probably the most well known example of these is the development of the short form-12 (12 items) and the short form-8, both derived from the short form-36 health status questionnaire, as well as the development of summary measures of physical and mental health (see http://www.sf-36.org) and http://www.rand.org).
While the shorter versions of these short form scales are inevitably less sensitive than the full versions, their careful and thorough psychometric development and calibration, based on the most powerful items from the parent instruments, has led to their retaining a high degree of accuracy, and hence their increasing popularity in research on clinical outcomes and population health. The longer SF-36 contains several questions to measure each of the eight dimensions that it includes (physical and social functioning, physical and emotional role limitations, mental health, energy/vitality, pain, and general health perceptions), but the SF-8 derived from it contains just one single item to measure each of these same eight domains. Moreover, the health perceptions item is a variation of the long used health status question: “Overall, how would you rate your health in the past four weeks? Excellent, very good, good, fair, poor, or very poor?” Thus the robustness of this item, which has been used, with small variations, as a single item measure of health status in population surveys for over half a century, has at last achieved authoritative acknowledgement.
In conclusion, with the use of more advanced statistical and psychometric techniques, and with awareness of the need to balance psychometric acceptability with practicality, scale developers have responded positively to the frequently asked question: “If one question works, why ask several?” Investigators now have an evidence base to guide their selection of longer or shorter multi-dimensional scales and/or single item measures, depending on the purpose and needs of the study. While shorter instruments are more limited than longer measures, they have obvious benefits for both research and policy in terms of reduced burden and costs, and ease of interpretation.
I would like to thank Professors Emily Grundy, Cathy Sherbourne, and John Ware for helpful information on the history of single item measures.
While shorter instruments are more limited than longer measures, they have obvious benefits for both research and policy in terms of reduced burden and costs, and ease of interpretation.