Article Text

Download PDFPDF

Measurement confounding affects the extent to which verbal IQ explains social gradients in mortality
  1. Benjamin Chapman1,
  2. Kevin Fiscella2,3,
  3. Paul Duberstein1,2,
  4. Ichiro Kawachi4,
  5. Peter Muennig5
  1. 1Department of Psychiatry, University of Rochester Medical Center, Rochester, New York, USA
  2. 2Department of Family Medicine, University of Rochester Medical Center, Center for Communication and Disparities Research, Rochester, New York, USA
  3. 3Department of Public Health Sciences, University of Rochester Medical Center, Rochester, New York, USA
  4. 4Department of Society, Human Development, and Health, Harvard University School of Public Health, Boston, Massachusetts, USA
  5. 5Department of Health Management and Policy, Columbia University, Mailman School of Public Health, New York, New York, USA
  1. Correspondence to Dr Benjamin Chapman, Department of Psychiatry, University of Rochester Medical Center, 300 Crittenden, Rochester, NY 14620, USA; ben_chapman{at}


Background IQ is thought to explain social gradients in mortality. IQ scores are based roughly equally on Verbal IQ (VIQ) and Performance IQ tests. VIQ tests, however, are suspected to confound true verbal ability with socioeconomic status (SES), raising the possibility that associations between SES and IQ scores might be overestimated. We examined, first, whether two of the most common types of VIQ tests exhibited differential item functioning (DIF) favouring persons of higher SES and/or majority race/ethnicity. Second, we assessed what impact, if any, this had on estimates of the extent to which VIQ explains social gradients in mortality.

Methods Data from the General Social Survey-National Death Index cohort, a US population representative dataset, was used. Item response theory models queried social-factor DIF on the Thorndike Verbal Intelligence Scale and Wechsler Adult Intelligence Scales, Revised Similarities test. Cox models examined mortality associations among SES and VIQ scores corrected and uncorrected for DIF.

Results When uncorrected for DIF, VIQ was correlated with income, education, occupational prestige and race, with correlation coefficients ranging between |0.12| and |0.43|. After correcting for DIF, correlations ranged from |0.06| to |0.16|. Uncorrected VIQ scores explained 11–40% of the Relative Index of Inequalities in mortality for social factors, while DIF-corrected scores explained 2–29%.

Conclusions Two of the common forms of VIQ tests appear confound verbal intelligence with SES. Since these tests appear in most IQ batteries, circumspection may be warranted in estimating the amount of social inequalities in mortality attributable to IQ.

  • Cognition
  • Mortality
  • Social Epidemiology

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Socioeconomic status (SES) and cognitive ability are powerful predictors of health and longevity.1 In fact, cognitive ability, as measured by the IQ, has been hypothesised to account for much of the SES-related health gradient.2 Supporting this hypothesis, correlational studies suggest that those with stronger cognitive skills may be able to better understand medical instructions, navigate social bureaucracies, and avoid accidental death.3–8 IQ also putatively correlates with markers of SES in the 0.4–0.55 range.9

IQ scores are themselves composed of two types of tests. ‘Performance IQ’ tests assess non-verbal reasoning and analytic ability. Tests of ‘Verbal IQ’ (VIQ) reflect language-based reasoning ability and knowledge.10 ‘True’ VIQ is presumed to reflect factors, such as the speed and facility of language acquisition, and abstract and symbolic manipulation of language to solve problems and attain goals.10 These skills assume different forms depending on the socioeconomic environment in which language is learned and reinforced, however.11 ,12 Thus, language-based problem solving may vary markedly across socioeconomic strata.

In psychometric theory, VIQ ‘true scores’ refer to test scores putatively measuring the concept of ‘true VIQ.’13 Under common statistical assumptions, VIQ true scores can be separated from other factors potentially influencing performance on VIQ tests that fall outside the definition of verbal intelligence. Many such factors have been implicated in VIQ test performance, including higher education, middle class SES and majority culture.11 ,12 VIQ tests scores, therefore, run the risk of mixing or confounding verbal intelligence true scores with educationally or socially acquired knowledge, class-differential verbal styles, academic motivation, standardised testing experience and other residue of social position.

More technically, this measurement confounding arises from violations of collapsibility and exchangeability.14 In other words, the association between the latent trait of ‘verbal intelligence’ (unobserved) and scores on a test of VIQ (observed) may not be collapsible across SES strata. Moreover, persons may not be exchangeable on non-intelligence attributes that affect response to items on the test (ie, measures of social standing). This type of systematic measurement error is called Differential Item Functioning (DIF).15 If it is present to a significant degree on a VIQ test, persons of lower SES may achieve artificially deflated VIQ test scores, driving the apparent association between SES and VIQ upward.16 As a result, the extent to which VIQ scores explain social gradients in mortality may appear larger than it actually is.

We examined whether two common forms of VIQ test exhibited DIF related to socioeconomic indicators (education, income, occupational prestige) and race in a US national sample. Although race/ethnicity may be less associated with social class in European countries, it is often considered a dimension of social stratification in the USA. We then compared SES correlations with VIQ scores, adjusted and unadjusted for DIF. Finally, we examined how much of the association between SES factors and all-cause mortality could be explained by VIQ scores, with and without correction for DIF. Our goal was to test a central premise of current research—that cognitive ability explains social patterns in mortality.


Sample and design

We used data from the General Social Survey (GSS), an annual nationally representative sampling of US population social practices and attitudes. Conducted by the National Opinion Research Center at the University of Chicago, the GSS uses a multistage probability sampling of non-institutionalised adults age 18 years and over, with response rates from 70% to 82% in any given year,17 yielding demographically identical annual samples. The GSS records age, gender, race/ethnicity, respondent occupation, income and years of education on the basis of face-to-face interviews with subjects. The Gallup-Thorndike Test of Verbal Intelligence18 (hereafter Gallup-Thorndike) was administered during these interviews to one-third to one-half of the sample randomly selected during the years 1978, 1982, 1984, 1987–2000. We used data for 9381 persons with complete data for all variables of interest. Those lacking data (usually an SES indicator) were more likely female, younger, minority, and in worse self-rated health (p<0.001); the resulting analytic sample was still broadly similar to that of the USA in 2000.19 The second VIQ test, the Wechsler Adult Intelligence Scales—Revised Similarities Test (WAIS-R Similarities), was administered in 1996 and yielded an analytic sample of 2444, by design demographically comparable to the broader Gallup Thorndike sample.


Occupation was coded using the Socioeconomic Index (SEI), a continuous measure of occupational prestige, based on US Census information. Income was calibrated to 1990 US dollars. The Gallup-Thorndike18 originally comprised 20 items taken from the Institute for Educational Research (IER) Intelligence Scale CAVD.20 In the GSS, 10 of these items were administered in person by an interviewer. Tests of vocabulary are presumed to assess word familiarity, and also (1) concept formation (without which the correct definition cannot be given) and (2) the ability to deduce meaning of unfamiliar words based on known roots or syllables, using answer choices provided.21

The second VIQ test, the WAIS-R Similarities test,22 presents persons with successively more difficult questions about how two different things are alike. For instance, an easy question might be ‘how are a fly and a mouse related?’ A completely correct response, such as ‘they are both animals’, receives two points. A response that is correct but does not capture the similarity at the most abstract level (eg, ‘they both have eyes’) receives one point. The WAIS-R manual contains detailed guidelines for scoring responses as 0, 1, or 2.22

Vital status through 2008 was ascertained from the National Death Index. The validity of the National Death Index is typically high, with matching certainty arising from social security numbers and the additional identifiers in the GSS reaching 99.8%.23 Further details on the GSS-National Death Index matching are available.17


Occupational prestige, incopme, and education were scaled by Relative Index of Inequalities (RII). The RII scores the person of highest standing on a social dimension as 0, and the lowest as 1.24 ,25 A 1 unit change in regression models, therefore, is interpretable as a relative risk, but a specific kind: the risk at the absolute top, relative to the absolute bottom, of a distribution.

Item Response Theory (IRT) analyses of the Gallup-Thorndike were conducted with the Rasch model26 and WAIS-R similarities analyses used the graded response model.27 The Rasch model is formally equivalent to a mixed-effect logistic regression treating test items as repeated measures within person,28 and estimating the probability of success or ‘difficulty parameter’ for each item independently of an examinee's standing on the latent trait (random effect). The graded response model is an extension for ordered responses analogous to the extension from a binary to ordered logistic model (we relaxed the proportional odds assumption). The online supplementary material provides technical details of these IRT models and DIF analysis.29

Briefly, we examined DIF related to race, SEI, household income and education, as well as age and gender (which are correlated with SES) using interaction terms28 in three increasingly stringent steps. In step 1, we screened social factor interaction terms separately for each item. In our second step, the model adjusted for all previously identified sources of DIF for a single item simultaneously. In the third stage, we adjusted for DIF factors across all items simultaneously. At each step, we retained those that were significant and met a DIF effect size threshold such that the item's difficulty was 30% easier at one end of a sociodemographic dimension than at the other, irrespective of VIQ true score. Latent trait scores for each test were then estimated from IRT models with and without this final set of DIF interaction terms.

We examined impact of DIF on SES-VIQ associations via Pearson correlations between each SES factor and VIQ scores unadjusted and adjusted for DIF. We computed the absolute difference (runadj−radj), as well as relative difference (runadj/radj) between DIF adjusted and unadjusted score correlations.30 We also estimated the association of DIF-corrected and uncorrected VIQ scores with mortality using Cox proportional hazards models with attained age as time scale and GSS baseline age as point of entry into the risk set,31 ,32 fitting three models for each SES factor. Each model included gender as a covariate with time-varying hazards, based on preliminary proportionality analysis. The first model estimated the SES factor's RII, or the HR for those at the most disadvantaged, versus advantaged end of the distribution. A second model then added VIQ scores unadjusted for DIF and computed the excess hazard explained by these scores as (HRunadjusted−HRadjusted)/(HRunadjusted−1), with 95% CIs obtained via bootstrap (1000 replicates). A third model then controlled for DIF-adjusted VIQ scores, again computing the change in estimate.


Table 1 shows the sample demographics for the Gallup-Thorndike sample (left) and the WAIS-R similarities subset of that sample (right). With respect to VIQ tests, assumptions underlying IRT models appeared to be satisfactory.33 For the Gallup-Thorndike, of the 40 interaction terms involving race, SEI, household income and education, 21remained statistically significant and met the effect size criteria by the end of the three-stage screening. For the WAIS-R similarities test, of the 80 possible interactions, 14 remained significant and met the effect size threshold at the end of the third stage. Social factor DIF favoured white race and higher education, occupation and income. Additional age-related and gender-related DIF was observed on both tests, although the pattern did not consistently favour one gender or younger versus older persons. Online Supplementary table S1 lists the sources of DIF by item for each test. Social DIF seemed more apparent on vocabulary (9 out of 10 items) than similarities (5 out of 8 items).29

Table 1

Demographic composition of the analytic sample: 1978–2002 General Social Survey inked to the 2008 mortality via the National Death Index

Table 2 reports the correlations between SES factors and VIQ test scores corrected and uncorrected for DIF. SES correlations with Gallup-Thorndike scores unadjusted for DIF were 0.16–0.33 larger in absolute magnitude, and 2.8–4.4 larger in relative magnitude, than unadjusted scores. For the WAIS-R Similarities, absolute differences in SES correlations ranged from 0.06 to 0.24 and relative differences from 2.0 to 2.6. DIF-adjusted correlations between SES factors and VIQ indicators fell outside the 95% CI of correlations with non-adjusted scores for all SES indicators on both tests.

Table 2

Pearson correlations between SES, race, and verbal IQ test scores: 1978–2002 general social survey linked to the 2008 mortality via the National Death Index

Table 3 shows the RII as a HR for mortality for each social factor. Minority race exhibited non-proportional hazards (diminishing risk over the lifecourse), so estimates are presented at age 50 years. Table 3 also shows the change in estimate observed when controlling for latent trait scores adjusted and unadjusted for DIF. VIQ scores unadjusted for DIF accounted for smaller, but non-zero portions of social inequalities in mortality. For the Gallup-Thorndike, the change in estimate arising from corrected scores fell outside of the CI of that for uncorrected scores across three of four social factors. For the WAIS-R similarities test, the same pattern arose, but with wider CIs. Table 4 shows the RII for the WAIS-R similarities and Gallup-Thorndike scores with and without SES-corrected DIF. DIF-corrected scores showed smaller RIIs, with no appreciable difference in the proportion explained by SES. The latter quantity evidenced a very wide CI encompassing 0 in all cases. Sensitivity analyses revealed linearity in the log hazard for all factors, no VIQ social factor interactions or proportionality violations, nearly identical results excluding deaths within the first year, and comparable results with 2-parameter IRT models.

Table 3

Social inequalities in mortality explained by biased and unbiased VIQ test scores

Table 4

VIQ Inequalities in Mortality Explained by SES Indicators


Across two VIQ tests, we found DIF favouring persons of higher SES and/or majority race/ethnicity group. Correcting for this, DIF reduced correlations between VIQ scores and educational attainment, occupational status, income, as well VIQ differences between African–Americans and Caucasians. In turn, VIQ scores adjusted for DIF explained smaller amounts of social inequalities in mortality.

Some have argued that intelligence, rather than SES, is the fundamental cause of differentials in mortality.2 This assertion is supported by many findings that cognitive ability test scores are substantial confounders of SES mortality risk.34 Our findings suggest that DIF-corrected VIQ scores had slightly less association with SES and mortality than uncorrected ones, so a portion of the predictive power of VIQ may arise from indirect SES variance captured by VIQ test scores.

It is important to note, however, that small social differentials in VIQ still existed even when DIF was controlled. Accordingly, DIF-corrected VIQ scores continue to explain a modest portion of social gradients in mortality. This would suggest that VIQ is somehow involved in social inequalities in mortality, albeit to a smaller extent than has been presumed.

Environmental exposures35 and early malnutrition36 have documented effects on brain development and cognitive ability, and it is plausible, if not likely, that persons scoring lower on IQ tests, consequently, are challenged with respect to school performance, occupational advancement and earnings.37 Thus, our data indeed suggest a legitimate—and probably reciprocal—association between VIQ and SES. Given the importance of this issue for policy, the critical question is not whether there is a link, but exactly how much measurement inaccuracy inflates our current estimates.

Specifically, DIF observed here may be explained by numerous factors affecting IQ test performance that are associated with social disadvantage. These include achievement motivation,38 greater test performance anxiety and stress,39 ,40 fear that poor test scores will be used to perpetuate stereotypes about class and intelligence,41 ,42 lack of familiarity with test content among participants from lower SES and/or racial/ethnic minority subcultures,11 different norms for, or uncertainty in, approaching test problems,43 ,44 use of different dialects, distrust of examiners administering the test,45 ,46 less familiarity with testing,47 a lower reading level,48 ,49 and poorer test-taking skills.50 The difficulty of disentangling verbal intelligence from factors relating to culture and academic achievement has been reported for some time.10 Nevertheless, the fairness of IQ tests across SES is justified, in part, by reports that IQ tests with unknown degrees of SES DIF predict SES outcomes.37 Such justifications may require reconsideration if VIQ tests confound Verbal IQ and SES to a non-trivial degree.

Our results must be interpreted with a balanced understanding of strengths and limitations. First, while these considerations suggest that VIQ tests capture educational and other SES variance, an important parallel argument has been offered: years of education, perhaps the most common index of SES, might actually measure some form(s) of intelligence, because cognitive abilities are generally required to achieve higher levels of education.51 From this viewpoint, adjusting any type of VIQ score for education-related DIF corrects for an IQ proxy and, thus, is an overcorrection. However, since ‘years of education’ is not a multi-item test score, IRT analyses cannot examine the issue. One future solution may be to use multi-item tests of academic achievement as a measure of education amenable to traditional IRT approaches and, thus, potentially separable from various forms of IQ. Second, VIQ is just one of two components of general IQ scores. Tests of the other component, Performance IQ, have been suggested by many,52 ,53 but not all,11 ,12 to avoid mixing SES with cognitive ability measurement. In this regard, many in the cognitive epidemiology community have begun to focus on measures of Performance IQ, including tests of reaction time or processing speed, as the key cognitive abilities predictive of mortality.8

Third, we only examine two common tests of VIQ. Although we did not study other tests, these two tests correlate highly with other VIQ tests (ie, 0.7 to 0.8),10 ,54 and with general IQ scores.10 ,21 Thus, we suspect that other VIQ tests, and general IQ scores, may be susceptible to some extent to this phenomenon. However, these results may or may not generalise to non-cognitive psychological tests, such as personality measures, which may also be vulnerable to DIF and deserve study in their own right. It is also important to remember that SES is multidimensional, that some dimensions of SES might be more vulnerable to DIF than others, that different dimensions of SES may have differential associations with mortality, and these associations may vary at different points in the lifespan.

Although our analysis addresses these concerns for three common indicators of SES measured once, the use of other indicators would be helpful, such as the quality of education received or family social position. Longitudinal studies could examine the extent to which cognitive abilities at various points in the lifespan mediate prior SES-related health risks. Performance IQ, and/or tests based on theories of multiple intelligences,55 may contribute better to our understanding of the inter-relationships between class, intelligence and health.

Ultimately, most IQ batteries used in epidemiologic study include vocabulary and/or similarities in VIQ tests. Thus, the behaviour of these tests will be transmitted to general or composite IQ scores, upon which many conclusions are based. If the other tests in the battery do not evidence social-factor DIF, overestimation of IQ-SES associations will be more attenuated. However, the number of other tests in the battery exhibiting similar DIF will dictate the degree of overestimation, and this is an unknown. Our findings thus constitute a ‘proof of principle’ suggesting care in interpreting data on IQ and social gradients in mortality.

What is already known on this subject

  • General IQ scores are thought to partially explain social gradients in mortality.

  • General IQ scores are composed of Performance IQ, and Verbal IQ (VIQ) tests.

  • VIQ tests are suspected to confound true cognitive ability with socioeconomic status (SES).

  • Measurement error may lead to overestimates of the extent to IQ explains social inequalities in mortality.

What this study adds

  • Two common types of VIQ tests exhibit differential item functioning favouring persons of higher SES in a nationally representative US cohort.

  • High correlations between VIQ scores and SES are inflated due to differential item functioning.

  • Correction for differential item functioning reduces the explanatory role of IQ in social inequalities in mortality.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • Contributorship statement BC, KF, PD, IK, PM: contributed to the conception and design of the work, and interpretation of data; contributed to drafting the work and critically revised it for important intellectual content; final approval of version to be published. BC, PM: contributed to the acquisition and analysis of data.

  • Funding This work was supported by US National Institutes of Health grants RC2MD004768, R01AG044588, and K08AG031328.

  • Competing interests None.

  • Patient consent No.

  • Ethics approval Institutional Review Board for the General Social Survey.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement The data is publicly available through the General Social Survey.