Chinese SF-36 Health Survey: translation, cultural adaptation, validation, and normalisation
- 1Department of Social Medicine, School of Medicine, Zhejiang University, China
- 2Department of Health Statistics, School of Medicine, Zhejiang University
- Correspondence to: Professor L Li, Zhejiang University School of Medicine, 353 Yan’an Road, Hangzhou, 310006, Zhejiang Province, China;
- Accepted 11 September 2002
Study objective: To develop a self administered Chinese (mainland) version of the Short-Form Health Survey (SF-36) for use in health related quality of life measurements in China.
Design: A three stage protocol was followed including translation, tests of scaling construction and scoring assumptions, validation, and normalisation.
Setting: 1000 households in 18 communities of Hangzhou.
Participants: 1688 respondents recruited by multi-stage mixed sampling.
Main results: The assumption of equal intervals was violated for the vitality and mental health scales. The recoded item values were used to calculate scale scores. The clustering and ordering of item means was the same as that of the source and other two Chinese versions. The items in each scale had similar standard deviations except those in the physical functioning, boduily pain, social functioning scales. The item hypothesised scale correlations were identical for all except the social functioning and vitality scales. Convergent validity and discriminant validity were satisfactory for all except the social functioning scale. Cronbach’s α coefficients ranged from 0.72 to 0.88 except 0.39 for the social functioning scale and 0.66 for the vitality scale. Two weeks test-retest reliability coefficients ranged from 0.66 to 0.94. Factor analysis identified two principal components explaining 56.3% of the total variance. The Chinese SF-36 could distinguish known groups.
Conclusions: This study suggested that the Chinese (mainland) version of the SF-36 functioned in the general population of Hangzhou, China quite similarly to the original American population tested. Caution is recommended in the interpretation of the social functioning and vitality scales pending further studies.
An epidemiological transition from predominantly communicable diseases to chronic diseases has taken place since the middle of the past century.1 In mainland China, long term diseases became the main death causes of urban residents in the 1950s, and those of rural residents in the 1960s.2 The improved longevity suggests that health status can no longer be well assessed by population mortality statistics; there is a consensus to view health in terms of people’s subjective assessment of wellbeing and ability to perform social roles.3–6 The centrality of people’s point of view in monitoring health related quality of life has led to the proliferation of instruments and a rapid development of theoretical literature.7,8
The 36-item Short Form Health Survey is a brief self administered questionnaire that generates scores across eight dimensions of health: physical functioning (PF), role limitations due to physical problems (RP), bodily pain (BP), general health (GH), vitality (VT), social functioning (SF), role limitations due to emotional problems (RE), mental health (MH), and one single item scale on health transition. It has proved useful in monitoring population health, estimating the burden of different diseases, monitoring outcomes in clinical practice, and evaluating treatment effects.9 In 1991, the SF-36 was selected as the instrument in the International Quality of Life Assessment (IQOLA) Project.9–16 At the time of this writing, the SF-36 has been translated and tested in more than 40 countries and normed in 12 countries. Several Chinese versions (American Chinese, Hong Kong) have been developed and tested,17–19 but its acceptability or validity on Chinese in mainland China is not known.
In this article, we report the development of a Chinese (mainland) SF-36 Health Survey and report the results of psychometric testing among the general population in Hangzhou, the capital of Zhejiang Province, southeast of mainland China. We expect the study will stimulate further researches to establish the reliability, validity, and application of the SF-36 among various regions of China so that it can eventually be applicable to all Chinese.
Translation of SF-36
The study developed a three stage process to produce a cross culturally comparable translation of the SF-36 with the standard protocol as a reference.20 Firstly, two postgraduates of social medicine translated the original SF-36 into written Chinese independently. Translators had experience in questionnaire translation but were not familiar with the SF-36. The initial versions were administered to a convenience sample of 21 university students. The translators met in person with the principal investigator to agree on a common primary translation. Secondly, two English teachers rated the translation quality. The principal investigator discussed with the translators and eight professionals on questionnaire survey and developed a revised version. Finally, the revised version was pilot tested in a convenience sample of 28 subjects. Some minor changes were made to develop a final version.
A multi-stage mixed sampling was conducted to select a representative sample of the general population. During the first stage, six “Jiedao” (a sub-district neighbourhood administration) were selected from Xiacheng district (central area) and Gongshu district (sub-central area) of Hangzhou, three for each. During the second stage, three communities were selected from each “Jiedao”. Equal distance sampling was used. During the third stage, every household in a community had the same probability to be sampled that was equal to the fixed sample size n (1000 households) divided by the total households in the two districts represented as N. Family members in a sampled household, aged 18 and older, with the ability to read were eligible subjects. They were asked to complete a survey by self administration. The Myer’s index was used to detect preference for all terminal digits from 0 to 9. The theoretical range of Myer’s index is from 0 to 90. An index of 0 represents no heaping and an index of 90 represents a heaping of all reported ages at a single digit.21 The differences were analysed between respondents and non-respondents by the monovariate method and the logistic regression model. Fifty seven subjects were randomly sampled for test-retest study after two weeks.
Scoring of scales
When one half or fewer of the items in a scale were missing, the mean of the non-missing items was used to represent the scale. A scale score was declared missing when more than one half of the items were missing.9,12 Means and standard deviations of all scale scores were calculated.
The SF-36 scale scores were constructed using the method of summated ratings based on five assumptions12,22,23: (1) Categorical item responses should be on an interval scale. When the assumption is violated, the responses should be recoded to suit actual differences. This assumption could be checked only for scales that had more than two items with multiple choices: GH, PF, VT, MH. We computed, for each response of an item, the average value of the remaining items in the same scale. Then, we assigned empirical scores to each response level in the following fashion: the lowest response level was given the score of 1, the highest response level the score of K (for total K response levels), and the intermediate response levels were assigned scores that reflected intervals.14 (2) Items of a given scale should have approximately equal variances and means. (3) Item-scale correlations should be roughly equal for all items in a given scale. (4) Convergent validity: the correlation of each item with its hypothesised scale, corrected for overlap should be 0.40 or above. (5) Discriminant validity: the correlation of each item with its hypothesised scale should be significantly higher than correlations of the same item with competing scales (t test for correlation coefficients24).
Reliability was estimated using the test-retest method and the internal consistency method (Cronbach’s α). A minimum Cronbach’s α coefficient of 0.7 is considered satisfactory for group level comparisons.9 Validity was assessed using convergent and discriminant validity checks, factor analysis, and construct validity. Factor analysis was expected to yield two principal components named as physical health and mental health. In test of construct validity, or known groups validity, scale scores were compared across groups known to differ, using external information independent of the SF-36. It was hypothesised that SF-36 scores for the old would be lower than those for the young; women would have lower scores than men; people reporting longstanding health conditions would have lower scores than those without any such conditions.10,25
All statistical analyses were carried out using the Statistical Package for the Social Sciences (SPSS 7.0 for Windows).
The Chinese SF-36 translation was equivalent to the original version with a few exceptions. Bowling and playing golf (PF02) were common among Americans and Europeans but not in Chinese. In this version, mopping the floor and practising Tai-Chi were used as complementary examples of moderate activities for clarity because we did not know exactly whether they were culturally equivalent. Translating a mile into its mathematically correct equivalent of 1609 metres expresses a degree of accuracy not intended in the original form. Thus, one mile was translated into 1500 metres. One block was translated into the distance between two street crossings. Some difficulties were also encountered in producing corresponding expressions in Chinese equivalent to full of pep (VT01) and have a lot of energy (VT02). In this Chinese version, VT01 conveyed that one is ready to work physically and spiritually, while VT02 emphasised physical health.
Completeness of data
Of the 1972 eligible subjects, respondents were 1688 (85.6%). The mean age was 46.0 years. The Myer’s index was 7.94, suggesting a fairly accurate age reporting. Among the respondents, 859 (50.9%) were male. Education levels: 23 (1.4%) were illiteracy or quasi-illiteracy, 243 (14.4%) had primary school education, 1115 (66.4%) had middle school education, and 299 (17.8%) had college or higher education. Marital status: 175 (10.5%) were unmarried, 1400 (84.4%) were married, 25 (1.5%) were separated or divorced, and 59 (3.6%) were widowed. The mean time to complete the questionnaire was 10 minutes. Altogether 1316 (78.0%) respondents answered all 36 items. On average, 3.8% of responses per item (range 0.3%–6.6%) were missing.
Non-respondents were older, female, less educated. Of them, 54.3% were 65 years old and over, 64.6% were women, 65.5% were illiteracy or quasi-illiteracy. There were significant differences in age, sex, marital status, education level, occupation, and family patterns between respondents and non-respondents (p<0.05). Results of logistic regression models suggested: higher education level, and closer ties of family relationship were predictive of response (p<0.05).
Tests of scaling assumptions
The assumption of equal intervals was well supported in the GH and PF scales. Going from the least to the most favourable answer, average empirical scores were 1.0, 3.0, 4.0, 4.5, 5.0 for GH01 item, 1.0, 1.5, 2.5, 3.5, 5.0 for GH02–GH05 items, and 1.0, 2.0, 3.0 for the PF scale. However, the assumption was violated in the VT and MH scales. The positions of the two most undesirable responses were switched. The empirical scoring schemes were 1.4, 1.0, 1.8, 3.6, 4.7, 6.0 for the VT scale and 2.7, 1.0, 1.2, 2.8, 4.2, 6.0 for the MH scale respectively.
The clustering and ordering of item means was the same as that of the source version22 and other Chinese versions,17,18 except for items GH01, PF02, PF03. The items for each scale had similar standard deviations except those for the PF, BP, SF scales. Table 1 shows the results of item convergent and discriminant validity tests. Correlations between items and hypothesised scale were 0.4 or above for all except item VT03 and the SF scale. The average scaling success rates were 91.4% (32 of 35) for convergent validity, and 92.5% (259 of 280) for discriminant validity.
Cronbach’s α reliability coefficients ranged from 0.72 to 0.88 for six scales, 0.66 for the VT scale and 0.39 for the SF scale that was equal to or below correlations between the SF and the RE, MH scales respectively. The correlation between the MH and the VT scale was 0.52. Table 2 shows comparison of Cronbach’s α in studies using different Chinese SF-36 versions.17,18 The two weeks test-retest reliability coefficients ranged from 0.66 to 0.94.
Factor analysis identified two principal components that could be used to explain 56.3% of the total variance. However, the results were not entirely consistent with the hypothesised model.9 The PF scale was fairly evenly loaded on the “physical” factor, the factor loading 0.59 is lower than the RP scale. The RE scale was found to have a strong association with the “physical” factor and a weak association with the “mental” factor. The VT scale was found to have a higher loading on the “mental” factor than the MH scale. The SF scale was fairly evenly loaded on the both factors (table 3).
With the transition of the disease spectrum, Health related Quality of Life (HRQOL) instruments are becoming necessary tools in the health status measurement and clinical effectiveness assessment. Although many have been developed for Western populations, few are available to the Chinese.
We report the development of a self administered Chinese (mainland) version of the Short-Form Health Survey (SF-36) and report the results of psychometric testing, reliability, and validity among the general population.
The Chinese (mainland) version of the SF-36 functioned in general population of Hangzhou similarly to the American population tested.
The results of studies on application of different versions of the Chinese SF-36 to different Chinese groups were compared.
To improve the Chinese SF-36 scales, further studies among various Chinese regions and ethnic groups are needed.
As tables 4 and 5 show, all the scale scores for the old were lower than those for the young (p<0.05), women had lower scores in all scales than men except the RE scale. The differences were significant (p<0.05) in the PF, BP, GH, and VT scales. Table 6 presents the norm reference by age and sex group. The comparison of the SF-36 scale scores for different Chinese populations and the US norms are given in table 7.9,17,26
The translation process set by the IQOLA Project entails forward translations by at least two translators who were native speakers of the target language, rating of translation quality by two other bilinguals, and back translations by two translators who were native speakers of American-English or British-English.20 Because native English speakers were unavailable, we did not fully adhere to this strategy. Our study suggested that the Chinese (mainland) version of the SF-36 functioned in the general population of Hangzhou, mainland China similarly to the original American population tested. Apart for the SF scale, seven scales succeeded in convergent and discriminant validity tests. Cronbach’s α coefficients of six scales were satisfactory for group comparison. The two weeks test-retest observed moderate to strong association. Factor analysis identified two principal components. Chinese SF-36 could distinguish known groups successfully.
However, there are still a few areas that need further examination. The item PF02 “moderate activities” and PF03 “lifting or carrying groceries” had lower means than their previous item cluster. This may be because “moderate activities” such as bowling and golf are uncommon and considered difficult to perform among Chinese, and the complementary example practising Tai-Chi is popular only with some old Chinese men. The same applied in the US study.17 “Lifting or carrying groceries” is abstract to Chinese in the mainland. The scaling assumption on equal item variance could not be satisfied in PF, BP, and SF scales. The standard deviations of PF05, PF09, PF10 measuring low levels of functioning were smaller than other items in the same scale because more than 85% of the subjects scored the highest score of 3 on these three items. The standard deviations of items BP02 and SF01 were smaller than items BP01 and SF02 respectively. The same was found in studies in Kong Hong and the US.17,18 This finding seems to point to the differences in cultural interpretation of items. Deeply ingrained in the Confucian ideology of collectivism, it is socially unacceptable for Chinese to use “sickness” as an excuse to avoid working or socialising with others.
The Chinese (American Chinese) version produced similar findings with respect to reliability, convergent, and discriminant validity tests.17 Both versions found poor (<0.4) levels of item-scale correlation for the SF scale. The item SF01 was more highly related to the BP, RE. and MH scales, and the item SF02 was more highly related to the VT and MH scales. The item VT03 was highly correlated with the MH scale than the parent scale. Cronbach’s α coefficient was below 0.70 for the SF scale. The MH scale was strongly correlated with the VT scale. However, application of the Chinese (Hong Kong) version shared less common factors with these data.18 Correlations between items and hypothesised scale were 0.4 or above for all except items PF03, PF05, PF09, PF10, and GH01. The scaling success rate for discriminant validity was 100% for all scales except the PF scale. Cronbach’s α coefficients were more than the inter-scale correlations for all the scales, but that for the SF scale was still below 0.7. Given the fact that there are apparent regional differences in China in terms of economy, culture, and even language, further researches among various Chinese regions and ethnic groups are needed to improve the Chinese SF-36.
Of the eight scales, the SF scale was least satisfactory in the scaling assumption testing, because of only two items in this scale and lower item homogeneity. Factor analysis revealed two principal components, but there were still some deviations from the hypothesised model. The study of a Chinese (Taiwanese) SF-36 version produced the similar results.19 Results of a Chinese (Hong Kong) version fit the hypothesised physical/mental health structure better,18 but application of the same version in a big sample size in Singapore yielded similar pattern of factor correlations comparable to our study.27 It is suggested that the conceptual framework of the instrument needs to be further improved for cross cultural health status measurement.
We gratefully acknowledge Mr Shanrong Cai, Miss Weining Ma, and MrsYe Gu for their help in translation, Professor Qianjin Jiang, Professor Jian Wang, and Professor Gengyao Fu for their useful comments on psychometric tests. We also thank the staff of Xiacheng and Gongshu District Health Bureau, for their assistance in the heath survey.
Funding: the study was supported by a grant from the Science and Technology Bureau of Zhejiang Province of China. (grant no: 991104209).
Conflicts of interest: none.