Article Text

Download PDFPDF

Validity of self reported diagnoses of cancer in a major Spanish prospective cohort study
  1. C Navarro1,
  2. M D Chirlaque1,
  3. M J Tormo1,
  4. D Pérez-Flores2,
  5. M Rodríguez-Barranco1,
  6. A Sánchez-Villegas3,
  7. A Agudo4,
  8. G Pera4,
  9. P Amiano5,
  10. M Dorronsoro5,
  11. N Larrañaga5,
  12. J R Quirós6,
  13. E Ardanaz7,
  14. A Barricarte7,
  15. C Martínez8,
  16. M J Sánchez8,
  17. A Berenguer4,
  18. C A González4
  1. 1Department of Epidemiology, Murcia Health Council, Murcia, Spain
  2. 2Department of Socio-Medical Sciences, Faculty of Medicine, Murcia University, Spain
  3. 3Department of Clinical Sciences, University of Las Palmas de Gran Canaria, Spain
  4. 4Catalan Institute of Oncology, Barcelona, Spain
  5. 5Department of Public Health of Guipuzkoa, San Sebastián, Spain
  6. 6Consejería de Salud y Servicios Sanitarios del Principado de Asturias, Oviedo, Spain
  7. 7Public Health Institute of Navarra, Pamplona, Spain
  8. 8Andalusian School of Public Health, Granada, Spain
  1. Correspondence to:
 Dr C Navarro
 Servicio de Epidemiología. Consejería de Sanidad, Ronda de Levante 11, E- 30008 Murcia, Spain; carmen.navarro{at}


Introduction: This study aims to assess the validity of self reported diagnoses of cancer by persons recruited for the Spanish EPIC (European prospective investigation into cancer and nutrition) cohort study and to identify variables associated with correctly reporting a diagnosis of cancer.

Methods: 41 440 members of EPIC were asked at the time of recruitment whether they had been diagnosed with cancer and the year of diagnosis and site. The process of validating self reported diagnoses of cancer included comparison of the cohort database with the data from the population based cancer registries. Cancer diagnostic validity tests were calculated. The association between a correct report and certain sociodemographic, tumour related, or health related variables were analysed by logistic regression.

Results: The overall sensitivity of self reported diagnoses of cancer is low (57.5%; 95% CI: 51.9 to 63.0), the highest values being shown by persons with a higher level of education or with a family history of cancer and the lowest values by smokers. Breast and thyroid cancers are those with the highest diagnostic validity and uterus, bladder, and colon-rectum those with the lowest. In both sexes the variables showing a significant association with a correct report of cancer are: higher education level, number of previous pathologies, invasive tumour, and, in women, a history of gynaecological surgery.

Conclusions: The overall sensitivity of self reported diagnoses of cancer is comparatively low and it is not recommended in epidemiological studies for identifying tumours. However, self reported diagnoses might be highly valid for certain tumour sites, malignant behaviour, and average to high levels of education.

  • validation
  • cancer
  • self reported diagnosis
  • cancer registry
  • EPIC

Statistics from

Validation of self reported diagnoses of cancer by means of a questionnaire has been undertaken extensively in epidemiological research, usually evaluating the utility of answers on family histories of cancer,1 screening tests,2 and personal histories of cancer.3 However, in cohort studies where the incidence of cancer is the end point, the aim of these studies is twofold: they are used in the recruitment stage to exclude persons who already have cancer (prevalent cases) and serve during follow up to identify incident tumours.4,5

Although for many years studies have been conducted on the agreement between health data from questionnaires and those provided by medical records,6,7,8,9,10 not so many have been used to validate questions relating to cancer11–15 and even fewer to compare questionnaire data against data from the population based cancer registries that cover the areas of study.16–19 A wide difference has been seen in the ability to self report cancer across countries; thus, the highest valuable information has been reported in USA13,19 and the lowest in Japan.3 In Europe, sensitivity ranged between 67% and 82%.16,18

The main factors associated with a correctly self reported diagnosis of cancer that are consistently given in the literature are age, sex, and level of education, although other associated variables have been found such as tobacco smoking, site, and behaviour of the tumour or time between diagnosis and interview. The differences seen among countries have been partly attributed to socio-cultural determinants and the variability in the practice by medical practitioners.3,17–20

Studies on the validity of self reported diagnoses of cancer are scarce in countries outside the USA and northern Europe, and this is the first study conducted in a southern European country. The aims of the study are to estimate the validity of self reported diagnoses of cancer at enrolment (prevalent cases) by persons recruited for the cohort of the EPIC-Spain study and to identify the variables associated with correctly reported diagnoses.


EPIC-Spain cohort

The Spanish EPIC (European prospective investigation into cancer and nutrition) cohort forms part of a larger cohort recruited in 10 European countries to study the relation between diet, cancer, and other chronic diseases.21,22 In Spain it is made up of 41 440 people recruited between 1992 and 1996. For this study on the validation of self reported diagnoses of cancer we have excluded a total of 202 cases of cancer, most of them (124 cases) for being non-melanocytic skin cancers, 74 cases because they self reported cancer at a time before the introduction of the corresponding cancer registries and four cases for being multiple primary tumours. Thus the final analysis includes 41 238 persons aged from 29 to 69 years.

Participation from five regions: three from the north (Asturias, Guipuzkoa, and Navarra) and two from the south (Granada and Murcia).23 The participants are healthy volunteers, mainly blood donors, who received a letter of invitation and agreed to participate. At the time of recruitment each person gave information on their dietary habits together with their anthropometric measurements and a blood sample.24 In addition they were given a questionnaire on lifestyles and other non-dietary factors, including tobacco smoking, physical activity, level of education, medical history (heart attack, diabetes, cancer, hypertension, etc), and family history of cancer. The section on personal medical history included the question: “Has a doctor ever told you that you suffer or have suffered from cancer?” If so, they were to indicate the age they were when it began and specify, with an open response, the site.

The persons who agreed to participate in the study gave their informed consent. They were informed that all the information given was confidential and that the databases were registered at the Spanish Data Protection Authority as stipulated by law.

Cancer registries

The five regions in which the EPIC-Spain study is conducted have population based cancer registries,25 form part of the European Network of Cancer Registries,26 publish their data regularly in the series Cancer Incidence in Five Continents by the International Agency for Research on Cancer,27 and began operating between 1970 (Navarra) and 1986 (Basque country). Registrable cases are all new malignant tumours, including in situ carcinomas, in any anatomical site, except non-melanoma skin tumours in Asturias and the Basque country. The International Classification of Diseases for Oncology, 2nd edition,28 was used to codify the anatomical site and morphology of tumours according to the information obtained from the questionnaire.

Record linkage of self reported cases with registered cases

The history of cancer reported in the interviews was compared with the data from the cancer registries in the five regions, with the validation period ranging from 6 to 26 years before the date of interview (median: 10 years) depending on the date of introduction of the registries.

The record linkage method is a widely contrasted procedure29–31used by disease registries to validate information, detect cases and in cohort studies establish the presence or absence of disease and the person/years at risk.32 With this aim a software application was developed to perform a probabilistic linkage between the data from the EPIC study and those from the population cancer registries of the five participating communities. The first stage of the record linkage process consists in the specific normalisation of each of the variables that is going to form part of the probabilistic linkage. In the second stage a duplicate control generates a table containing the persons where some or all of the cross-variables match. The programme then performs the probabilistic linkage (EPIC-Link), assigning different values to the persons in the EPIC study depending on the similarity of the cross-variables to those contained in the cancer registries. According to data available from other studies,33,34 the simulations carried out and the calculations of the odds of a random match between two person, scores are assigned to the most important identification variables. After multiple tests to determine the cut off point in the score beyond which the record linkage results should be reviewed, one was established that would maintain a high positive predictive value (PPV) without losing a high sensitivity. The cases finally detected as possible matches were rigorously inspected with a view to their confirmation. Other variables that were available in both sources were also used.

Clinical records of reported cases not linked with the cancer database were reviewed in each region before accepting a case as a false positive. If it really was a true positive case missed by the cancer registry it was added to the cancer registry and accounted as a true positive self reported case.

Study variables

The sociodemographic variables included in the analysis are sex, age (29–44, 45–54, and 55–69 years), and maximum level of education attained (incomplete primary school, primary, and secondary or further education). The associations with disease related variables were analysed, such as age at the time of diagnosis, year of diagnosis (⩽1989, ⩾1990), number of years between questionnaire and diagnosis (defined as the interval between the date of the cancer diagnosis, and the date of the baseline questionnaire and grouped into four categories: 0–2, 3–4, 5–7, and >7 years), level of tumour infiltration (infiltrating or in situ), and the basis of diagnosis (microscopical or clinical confirmation). Also included were variables related to health status and lifestyles, such as tobacco smoking, family history of cancer, gynaecological surgeries in women, number of previous pathologies (myocardial infarction, angina, ictus, other circulatory problems, hypertension, hyperlipaemia, diabetes, kidney stones or gallstones, polyps, peptic ulcer, asthma, and urinary infection), being a blood donor, and body mass index (BMI).

Analysis of data

Sensitivity, specificity, PPV, and negative predictive value (NPV) with their corresponding 95% confidence intervals were calculated using the reports from the cancer registry as a standard reference (table 1). Sensitivity was calculated as the proportion of persons with a cancer report in a registry who also self reported the cancer; specificity as the proportion of persons not found in the registry who did not self report cancer. The PPV was calculated as the proportion of persons with a self reported cancer during the years of registry operation who had a matching cancer report in the registry; and NPV as the percentage of persons who denied having cancer and whose name did not appear in the cancer registry. The analysis was done for all the persons overall and, stratified, by sociodemographic variables, lifestyles, and health status. Certain variables had missing values (tobacco smoking 0.1% of the cohort and level of education 0.7%). Excluded were non-melanocytic skin tumours, for not being recorded in all the cancer registries, and multiple primary tumours.

Table 1

 2×2 table used in the calculation of validity indicators

Also analysed were the validity indicators according to tumour site, including as false negatives any persons identified in the cancer registries not reporting cancer or reporting it in a different site to that analysed, and as false positives any non-confirmed cancers or cancers confirmed in another site (table 2).

Table 2

 2×2 table used in the calculation of validity indicators by site. For example, breast cancer

A multivariate logistic regression analysis was used to identify variables associated with a correctly self reported history of cancer using the crude and adjusted odds ratio including the 95% confidence interval.


The final analysis included 41 238 persons. Of the 260 persons who reported a cancer, 184 were confirmed as malignant tumours by the population cancer registries, 177 by record linkage, and seven after reviewing the clinical record of those not linked. A hundred and thirty six persons not self reporting a diagnosis of cancer were identified with cancer by the registries, accounting for a 43% rate of false negatives in this population.

Table 3 shows the main characteristics of the persons who did or did not self report a diagnosis of cancer. Of note is the older age, higher percentage of women, family history of cancer, and, in women only, gynaecological operations among the persons who correctly self reported cancer, whereas in the same group tobacco smoking or being a blood donor was less common.

Table 3

 Sociodemographic characteristics and variables related to health status according to whether persons of the EPIC-Spain cohort self report cancer

The diagnoses of tumour confirmed by the cancer registry with regard to the total number of cases reported (table 4) give an overall sensitivity of 57.5% (95% CI: 51.9 to 63.0) and a specificity of 99.8% (95% CI: 99.8 to 99.9). The total positive predictive value was 70.8% (95% CI: 64.8 to 76.2) and negative predictive value 99.7% (95% CI: 99.6 to 99.7). The highest sensitivity values are shown by persons with a higher level of education (74.0; 95% CI: 59.7 to 84.4) or with a family history of cancer and the lowest values by smokers (43.1; 95CI: 30.9 to 56.0). Men show one of the lowest sensitivity values (48.4%; 95% CI: 37.9 to 59.0) but a high positive predictive value of 81.8% (95% CI: 64.8 to 76.2).

Table 4

 Indicators of diagnostic validity and 95% confidence intervals (CI) of self reported prevalent cancer. EPIC-Spain cohort

Table 5 shows the validation analysis taking into account the topography of the tumour. Breast cancer is seen to present the highest sensitivity (84.5; 95%CI 75.0 to 91.5), followed by thyroid cancer (61.9; 95%CI 38.4 to 81.9); cancer of the cervix uteri and corpus uterus, bladder cancer and colorectal cancer have the lowest sensitivity, with very low values of between 13% and 17%. When the cervix uteri, corpus uteri, and uterus with no other specification are grouped in a single site, sensitivity improves slightly.

Table 5

 Indicators of diagnostic validity and 95% confidence intervals (CI) of self reported prevalent cancer by tumour site. EPIC-Spain cohort (n = 41238)

Identification of variables associated with a correctly reported history of cancer was done including the 320 cases of cancer identified in the cancer registries (true positives and false negatives). The univariate analysis shows a higher frequency in self reporting correctly and being a woman, having a higher level of education, having had the diagnosis more recently, presenting with an infiltrating cancer and, in the case of women, having a history of gynaecological surgery. Many variables lose their significance in the adjusted analysis, but higher level of education (with an adjusted OR of 6.41; 95% CI 1.98 to 20.82), infiltrating nature of the tumour (ORadjusted: 3.15; 95% CI: 1.17 to 8.52) and, in the case of women, history of gynaecological surgery (ORadjusted: 2.68; 95% CI 1.10 to 6.52) are maintained. Being a woman is very close to statistical significance (table 6).

Table 6

 Variables associated with a correctly reported cancer in the members of the EPIC-Spain cohort (cases of cancer included)

What is already known on this subject

  • A few studies have been carried out on the validity of self reported diagnoses of cancer that use population based cancer registries as a reference method.

  • Most of the studies show acceptable or high sensitivity depending on the tumour site and very high specificity.

  • The main factors associated with a correctly self reported diagnosis of cancer are age, sex, and level of education

  • Self reported diagnosis of breast cancer has a very high validity


The likelihood of a correctly self reported cancer in this study is not very high (58%) and in the range of other studies.16–18 However, big differences have been seen among countries: in USA, Bergmann et al report in the cancer prevention study the highest overall sensitivities19 (79% for an exact agreement of site and year of diagnosis (±1 year) and 93% for any diagnosis of cancer) and in the California teachers study have been reported values between 70% and 96% depending on site.20 Only in Connecticut were lower values found, but this study has a community based design and the former were done on selected groups in cancer prevention studies.17 In Japan3 (36% for any type of cancer) and France5 (21% in persons aged over 75 years) they are much lower. In Europe, our study—the first in a Mediterranean country—shows lower sensitivity than the two previously published in north European countries.16,18 On the other hand, all the studies published show very high specificities, where the presence of cancer can be ruled out with relative certainty if the person denies having received a medical diagnosis of the disease. In other words the persons in this study tend to underreport rather than overreport a history of cancer, which is why estimated specificity and NPV agree with those of other studies that report values of over 90%.16,18

The high sensitivity encountered in the Spanish EPIC cohort for breast and thyroid cancer (85% and 62% respectively) coincides with the sites that present a high sensitivity in the California teachers study20 (96% and 93% respectively), although the values in the Spanish EPIC cohort are considerably lower; they also coincide in that cervical and endometrial cancer are those with the lowest sensitivity. The JPHC study cohort3 finds a high sensitivity for breast (81%) and low sensitivity for uterus (42%) and colon-rectum (14%); the latter site is similar in the EPIC study (17%), the same as in the cancer prevention study 2 nutrition survey,19 with a 91% sensitivity for breast and 16% for colon-rectum.

It has been suggested that cancers with very clear cut diagnostic criteria, such as breast and thyroid cancer, are more likely to be reported than cancers that have more ambiguous diagnostic procedures.20 Chambers et al suggested that reporting might be lower for cancers that have a large proportion of less severe histological types, such as cervical cancer.35

In our study, older age at diagnosis and higher number of years between diagnosis and date of interview were associated with self reporting incorrectly although these associations were not statistically significant, possibly because of the relatively low number of cases. Desai et al17 also find that old age and longer time between diagnosis and interview are associated with self reporting incorrectly. Likewise, in the study by Parikh-Patel et al20 old age reduces the validity of the diagnosis. In one of the cohorts of the EPIC-Sweden study Manjer et al18 find that the overreporting or underreporting of cancer occurs more often among elderly persons. In our study women report better than men, although this association is not statistically significant. Sex also influences the report of tumours in other studies as occurs in the Swedish EPIC cohort,18 but in this case a higher frequency of overreporting is seen among women.

We found that invasive tumours are self reported better than those in situ. Parikh-Patel et al20 report an increase in sensitivity when tumours are invasive (98.1%) or not in situ (87.8%). In women, a history of gynaecological surgery is usually associated with tumour related operations, which is why the result here gives greater validity to the information analysed.

What this study adds

  • First study conducted in a Mediterranean country.

  • The sensitivity of self reported diagnoses of cancer in this cohort is comparatively low and it is not recommended in epidemiological studies for identifying tumours.

  • For certain tumour sites, malignant behaviour, and medium to high levels of education a self reported diagnosis may have a high validity

The most important factor associated with a correct self report is level of education. This result is in agreement with other studies.3,5,19 It seems that the most plausible reasons for the wide differences seen among countries are sociocultural factors and variability in medical practice. This is in concordance with the finding that patients are more often informed about a cancer diagnosis in the USA and Scandinavia, than in southern Europe and Asia.18 Recent studies36 in oncological patients that have inquired into a real knowledge of their process have shown that only a third of the patients know for certain, whereas another third do not know but suspect, and the other third are completely unaware. A study carried out in Spain showed that perceived intelligence and emotional control in the patients were the best predictors of the decision by doctors to give information. Age and socioeconomic status were also significantly associated with the doctors’ information giving practices.37

One limitation of the study is the external validity of the results, as although sensitivity and specificity are tests of internal validity, the predictive values are indeed influenced by the selection of the population. Although in this cohort study the participating persons are the general population, they were not selected as representative of this population. However, as they do belong to different Spanish regions, include a large number of people of different levels of education, reside in rural and urban areas, and present a smoking habit distribution similar to the general population, the cohort does not show any notable differences when compared with Spain’s adult population.38

The population cancer registries included in this study show good indicators of quality27 with a high coverage. There are no self reported cases in which it was not possible to ascertain if they were tumours or not, as whenever there was doubt the patient’s medical record was consulted or more information on the tumour was requested.

Incompleteness of the cancer registries could bias the results in two ways. First of all, increasing false positives, particularly for cases with date of incidence in the beginning of launching each registry. We think that this bias is unlikely because there are no self reported cases in which it was not possible to ascertain if they were tumours or not. For all self reported cases not linked with the cancer registry database the patient’s medical record was consulted or more information on the tumour was requested. Incidence date for the cases lost to registration was distributed along the study period, between 1987 and 1995. Incompleteness also could underestimate sensitivity. Seven of 184 true positive cases were lost to registration by cancer registries. If the proportion of cases lost to registration among false negative cases were the same, the completeness of registration for all registries together would be 96%. Thus, the true sensitivity would be slightly lower. Other possible bias could be originated because we have not validated self reported tumours diagnosed before cancer registries were started. As for 85% of self reported cases the years between questionnaire and diagnosis was less than eight (table 6) and the overlap between the cancer database and the reported incidence date is 16 years this bias is much unlikely to have occurred.

Another problem is the possibility that some persons not self reporting a malignant tumour were recorded as cases in the cancer registries and not detected by record linkage. But the EPIC databases are annually cross checked by record linkage with the data from the corresponding cancer registries and national population death database, and the rate of follow up of this cohort is very high, with less than 2% lost to follow up.38

It can therefore be concluded that the validity of self reported diagnoses of cancer in this large Spanish cohort is comparatively low, which is the reason why the overall information provided by the participants cannot be used individually for tumour detection. Population based cancer registries seem to be essential for ascertaining cancer cases in cohort studies carried out in Spain. However, for certain tumour sites, malignant behaviour, and medium to high levels of education a self reported diagnosis may have a high validity.



  • Funding: the project was financed by the European Union Europe Against Cancer Programme, the Spanish Health Research Fund (project no 99/0024) and the Spanish Autonomous Communities and participating Institutions. Some centres receive aid from the ISCIII Network RCESP (C03/09).

  • Competing interests: none.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles

  • In this issue
    Carlos Alvarez-Dardet John R Ashton