Adequately measuring physical activity (PA) is important for determining trends in PA levels over time, for evaluation of the effect of PA interventions and for determining health benefits of PA. Poor measurement of PA may hinder detection of important associations or effects.[1] Many questionnaires have been developed to measure PA. Some questionnaires were developed specifically for a certain subgroup or setting, others because researchers were not aware of existing questionnaires or because they were not satisfied with available questionnaires. Often researchers needed to translate and/or adapt existing questionnaires to other target groups. This has led to a large number of (versions of) questionnaires available, which makes it difficult to choose the most suitable instrument. Furthermore, the use of different instruments in different studies and surveys makes comparison of PA levels across countries or studies difficult.

To our knowledge, an overview of the measurement properties of PA questionnaires is lacking. A summary of these findings might be helpful for choosing the best questionnaire available for a specific purpose. Furthermore, a critical assessment of the methodological quality of the studies assessing the measurement properties of PA questionnaires is lacking, while the methodological quality of these studies might be variable. If the methodological quality of a study is poor, the results and conclusions can be seriously biased. For example, wrong conclusions can be drawn from a validation study if no adequate comparison instrument was used. It is therefore important to assess the methodological quality of a study to be confident that the design, conduct, analysis and interpretation of the study is adequate, and to inform about possible bias that might have influenced the results.

In this article, we aim to evaluate and compare the measurement properties of all available self-administered questionnaires measuring PA in adults, using a systematic approach for the literature search, data extraction and assessment of the quality of the studies. This article is one of a series of four articles on measurement properties of PA questionnaires published in Sports Medicine.

1. Methods

1.1 Literature Search

Literature searches were performed in PubMed, EMBASE using ‘EMBASE only’, and in SportDiscus® (complete databases until May 2009) on the topic of self-report questionnaires of PA. Additional papers were identified by manually searching references of the retrieved papers and the authors’ own literature databases.

The full search strategy in PubMed was as follows: (exercise[MeSH] OR ‘physical activity’[tiab] OR motor activity[MeSH]) AND (questionnaire[MeSH] OR questionnaire*[tiab]), and limited to humans. In EMBASE and SportDiscus®, ‘physical activity’ and ‘questionnaire’ were used as free text words and in EMBASE this was complemented with the EMTREE term ‘exercise’.

1.2 Eligibility Criteria

We used the following inclusion criteria:

1. The aim of the study should be to develop or evaluate the measurement properties – i.e. content validity, construct validity, reliability or responsiveness – of a self-report questionnaire.

2. The aim of the questionnaire should be to measure PA, which was defined as any bodily movement produced by skeletal muscles that results in energy expenditure above resting level.[2] PA in daily life can be categorized into occupational, sports, conditioning, household or other activities. Questionnaires were included regardless of the time frame; thus, questionnaires measuring lifetime PA or historical activity were also included.

3. The questionnaire could be used to measure PA in adults in the general population, and was not developed or evaluated in a specific population, such as patients or pregnant or obese participants.

4. The study sample should have a mean age between 18 and 55 years.

5. The article should have been published in the English language.

6. Information on (at least one of) the measurement properties of the self-report questionnaire should be provided. We included information on measurement properties only if it was intentionally collected or calculated to assess the measurement properties of the particular self-report questionnaire. If, for example, correlations between a self-report questionnaire and an accelerometer were presented to assess the validity of the accelerometer (while the self-report questionnaire was used as a gold standard) or if correlations between different PA questionnaires were calculated without one questionnaire considered as the standard, these data were not included in this review.

7. We excluded PA interviews or diaries. We also excluded studies that evaluated the measurement properties of a self-report questionnaire administered in an interview form. Finally, questionnaires measuring physical functioning (e.g. the degree to which one is limited in carrying out activities) and questionnaires asking about sweating in a single question were excluded.

1.3 Selection of Papers

Abstract selection, selection of full-text articles, data extraction and quality assessment were performed by two independent reviewers. Disagreements were discussed and resolved. We retrieved the full-text paper of all abstracts that fulfilled the inclusion criteria and of abstracts that did not contain measurement properties, but in which indications were found that these properties were presented in the full-text paper.

1.4 Data Extraction

We extracted a description of the self-report questionnaires from the included papers, using a standardized data extraction form. Data extracted included (i) the target population for which the questionnaire was developed; (ii) the dimension(s) of PA that the questionnaire intends to measure (e.g. habitual PA); (iii) the parameters of PA that the questionnaire is measuring (i.e. frequency, duration and intensity or activities); (iv) the setting in which PA is being measured (i.e. sport, recreational, transport, occupational/school activities, household activities [including gardening], other); (v) the number of questions; (vi) the recall period that the questions refer to; and (vii) the type and number of scores that were calculated (e.g. total energy expenditure or minutes of activity per day).

1.5 Quality Assessment of the Studies on Measurement Properties

To assess the methodological quality and results of the studies on measurement properties, we used the QAPAQ checklist (see table I for acronym definitions). We developed this checklist specifically for PA questionnaires, based on two recently developed checklists to evaluate the measurement properties of patient-reported outcomes COSMIN[8] and self-report health status questionnaires.[33] The QAPAQ is described elsewhere.[29] We extracted and rated the methods and results of all evaluated measurement properties (see sections 1.7–1.9).

Table I
figure Tab1

Table I Explanation of acronyms or abbreviated names of questionnaires

1.6 Content Validity

No criterion exists to rate whether the content of a questionnaire is relevant and comprehensive for measuring PA. Therefore, we formed our own opinion on content validity. Questionnaires should measure at least duration and frequency of PA, and if the intention was to measure total PA, the questionnaire should cover activities in all settings (work, home, transport, recreation, sport).

1.7 Construct Validity

The more similar the constructs that are being compared, the more evidence is provided for validity. Comparison with objective measures of PA (doubly labelled water, accelerometers, pedometers) was considered the best level of evidence (Level 1 or 2, depending on the use of the objective data). We considered constructs not really measuring current PA (maximal oxygen uptake [V̇O2max], body mass index [BMI], etc.) or another questionnaire, a diary or interview as less adequate comparison measures (Level 3). Depending on the strength of the hypothesized association with the comparison measure, different correlations were considered to be adequate (table II).

Table II
figure Tab2

Table II. Cut-off points for sufficient correlations per dimension of physical activity (PA) measured by the questionnaire, and level of evidence

A positive score was given if the study population consisted of ≥50 participants and the correlation was above the specified cut-off point. If the correlation was below the specified cut-off point, a negative score was given. If the sample size was <50 participants, the score was indeterminate (?).

1.8 Reliability

The time interval between the test and retest must have been described and short enough to ensure that subjects had not changed their PA levels, but long enough to prevent recall. The most optimal time interval depends on the construct to be measured and the recall period of the questionnaire. For measuring PA during the past or usual week or in the past year, a time interval of 1 day to 3 months was considered appropriate. For measuring lifetime PA, a time interval from 1 day to 12 months was considered appropriate.

For reliability, three levels of evidence were formulated:

  • Level 1: an adequate time interval between test and retest and an intraclass correlation coefficient (ICC), Kappa or Concordance.

  • Level 2: an inadequate time interval between test and retest and an ICC, Kappa or Concordance; or an adequate time interval between test and retest and a Pearson/Spearman correlation.

  • Level 3: an inadequate time interval between test and retest and Pearson/Spearman correlation.

An ICC >0.70 was considered acceptable.[34] The use of Pearson or Spearman correlation coefficients was considered inadequate, because it neglects systematic errors.[35] However, Pearson/Spearman correlations >0.80 would probably result in ICCs >0.70 and were therefore also rated positively, but on a second level of evidence. Pearson or Spearman correlations <0.80 were rated negatively.

A positive score was given if the study population consisted of ≥50 participants and the ICC, Kappa, Concordance or Pearson/Spearman correlation was above the specified cut-off point. If the correlation was below the specified cut-off point, a negative score was given. If the sample size was <50 participants, the score was rated as indeterminate (?).

1.9 Responsiveness

Responsiveness is the ability of an instrument to detect change over time in the construct to be measured.[36] It should be considered an aspect of validity in a longitudinal setting. Responsiveness was assessed by comparing changes in the PA questionnaire with changes in other instruments that measure closely related constructs. The same approach as for assessing validity was applied, except that change scores were being compared instead of absolute scores. Depending on the strength of the hypothesized association, different correlations were considered to be adequate.

2. Results

The search resulted in 21 891 hits, of which 260 abstracts were selected. Of the full-text articles with relevant titles and/or abstracts, 166 were excluded. Most of the papers were excluded because the questionnaire was administered in an interview or because no measurement properties of the questionnaire were assessed. Finally, 94 papers on 85 (versions of) questionnaires were included in the review (figure 1). Descriptive information on the questionnaires included in the review is provided in table III.

Fig. 1
figure 1

Fig. 1 Flowchart of literature search and paper selection. 1 One paper appears in both the review for adults and for elderly.

Table III
figure Tab3

Table III. Description of physical activity (PA) questionnaires (Q)

Table III. Contd
figure Tab3A

Table III. Description of physical activity (PA) questionnaires (Q)

Table III. Contd
figure Tab3B

Table III. Description of physical activity (PA) questionnaires (Q)

Table III. Contd
figure Tab3C

Table III. Description of physical activity (PA) questionnaires (Q)

Table III. Contd
figure Tab3D

Table III. Description of physical activity (PA) questionnaires (Q)

Table III. Contd
figure Tab3E

Table III. Description of physical activity (PA) questionnaires (Q)

Table III. Contd
figure Tab3F

Table III. Description of physical activity (PA) questionnaires (Q)

Table III. Contd
figure Tab3G

Table III. Description of physical activity (PA) questionnaires (Q)

2.1 Quality of the Studies

Construct validity was assessed for 77 questionnaires in 85 studies. Of these 77 questionnaires, 16 were validated at Level 1 and an additional 22 questionnaires at Level 2. Objective comparison measures were often V̇O2max (n = 40), accelerometers (n = 41), heart rate monitor (n = 5), doubly labelled water (n = 7) or pedometer (n = 6) [table IV]. Two of the three questionnaires specifically designed to measure walking were validated against pedometers (Level 1). Surprisingly, appropriate cut-off points for analysing accelerometer data were often not used when assessing time spent in moderate to vigorous PA, but instead total counts were used, which does not discriminate between light, moderate or vigorous PA.

Table IV
figure Tab4

Table IV. Construct validity of physical activity (PA) questionnaires (Q)

Table IV. Contd
figure Tab4A

Table IV. Construct validity of physical activity (PA) questionnaires (Q)

Table IV. Contd
figure Tab4B

Table IV. Construct validity of physical activity (PA) questionnaires (Q)

Table IV. Contd
figure Tab4C

Table IV. Construct validity of physical activity (PA) questionnaires (Q)

Table IV. Contd
figure Tab4D

Table IV. Construct validity of physical activity (PA) questionnaires (Q)

Table IV. Contd
figure Tab4E

Table IV. Construct validity of physical activity (PA) questionnaires (Q)

Table IV. Contd
figure Tab4F

Table IV. Construct validity of physical activity (PA) questionnaires (Q)

Table IV. Contd
figure Tab4G

Table IV. Construct validity of physical activity (PA) questionnaires (Q)

Table IV. Contd
figure Tab4H

Table IV. Construct validity of physical activity (PA) questionnaires (Q)

Table IV. Contd
figure Tab4I

Table IV. Construct validity of physical activity (PA) questionnaires (Q)

Reliability was assessed for 51 (versions of) questionnaires in 49 studies. Only 15 questionnaires were reliability-tested at Level 1 and an additional 36 questionnaires at Level 2 (table V). The most frequently occurring methodological shortcoming was that Pearson correlations instead of ICCs or Kappas were calculated. Another frequently occurring methodological shortcoming was an inadequate time interval between the test and retest.

Table V
figure Tab5

Table V. Reliability of physical activity (PA) questionnaires (Q)

Table V. Contd
figure Tab5A

Table V. Reliability of physical activity (PA) questionnaires (Q)

Table V. Contd
figure Tab5B

Table V. Reliability of physical activity (PA) questionnaires (Q)

Table V. Contd
figure Tab5C

Table V. Reliability of physical activity (PA) questionnaires (Q)

Table V. Contd
figure Tab5D

Table V. Reliability of physical activity (PA) questionnaires (Q)

Responsiveness was assessed for only two (versions of) questionnaires, and the quality of these studies was rated as Level 3.

2.2 Qualitative Attributes of the Questionnaires

In the study by Altschuler et al.,[7] it was tested whether respondents interpreted the LACE PA questionnaire and the CMH questionnaire as intended. In cognitive interviews, respondents described their thought processes while completing these two questionnaires. It was demonstrated that the term ‘intensity’ was frequently interpreted as emotional or psychological intensity rather than physical effort. In addition, it was found that respondents often counted the same activity more than once, overestimated occupational PA and mistook a list of examples for a definitive list.

We did not find studies in which the content validity of a PA questionnaire was assessed. However, we formed our own opinion on the content of the questionnaires.

Of the 85 (versions of) questionnaires included in this review, 23 had sufficient content validity: i.e. they covered all relevant settings of PA (e.g. for total PA all five settings; and for occupational PA only transport and work) and measured duration and frequency (Bharati,[45] EPIC original Questionnaire (Q),[10] EPAQ2,[9] Harvard/College Alumnus Q,[3,51] the long version of the IPAQ,[14] the adapted IPAQ,[54] Kaiser PA Survey,[56] LACE PA Q,[7] Minnesota LTPA Q,[61] Mail Survey of PA,[62] Norman Q,[70] NZPAQ-SF,[21] One-week recall Q,[71] PAFQ,[22] PA History Q,[72] PYTPAQ,[26] Singh Q,[77,78] SQUASH,[32] Historical RWJ questionnaire,[30] NPAQ,[20] Health Insurance Plan of NY,[3] TOQ[31,89] and London PA Q[88]).

2.3 Validation Results

Only the 48 studies that assessed construct validity at Level 1 or 2 are discussed below. Construct validity was assessed by validation against doubly labelled water for seven questionnaires.[16,21,40,81,104,105] In all these studies, the correlation of total energy expenditure assessed with the questionnaire and with doubly labelled water was lower than our criterion of 0.70, with Pearson correlations ranging between 0.31 and 0.58 (table IV).

In 41 studies, construct validity was assessed by validation against accelerometers (table IV). For only one questionnaire, validated in a study with >50 participants, the correlation between accelerometer data and total PA was >0.50 (Suzuki Q[81]).

In an attempt to find out which type of questionnaire performed best, we averaged the correlations found in the 41 studies using accelerometers as the comparison measure. It was clear that correlations differed slightly between vigorous and moderate activity, with higher correlations for vigorous activity (r = 0.32 vs 0.22). Also, a higher correlation was found for questionnaires asking about the past week, instead of a usual week/usual PA/current PA or about the past year (r = 0.41 vs 0.26 and 0.30, respectively).

Two questionnaires designed for measuring walking were validated against pedometers (Level 1). One scored negative[85] and the other was rated as indeterminate because of a statistical analysis that could not be interpreted.[84] The reliability of 15 versions of PA questionnaires was assessed at Level 1 (table V), and only five showed positive results: the self-administered, short version of the IPAQ on PA in the past 7 days (S7S),[93] the Modified HLAQ,[11,102] the NPAQ[20] and the Bone Loading History Q[86] scored positive on all aspects, and the Kaiser PA Survey[56] scored positive on all aspects, except ‘care giving’. The other questionnaires showed mixed results or scored negative on most aspects, or scored indeterminate because of a small sample size.

In addition to the 15 questionnaires for which evidence on Level 1 was available, Level 2 evidence was found for another 36 (versions of) questionnaires. For only six questionnaires, a positive score on Level 2 was given (Modified Baecke [(ARIC) Baecke],[4] Health Insurance Plan of NY Q,[3,31] Lipid Res Clin Q,[3,59] Minnesota LTPA Q,[3] the Minnesota Heart Health Program Q,[3] and the Minnesota Heart Health Program Occupational Q[31]). The other questionnaires showed mixed results or scored negative on most aspects, or scored indeterminate because of a small sample size.

When averaging the results of the reliability studies, no clear differences were found between questionnaires with different recall periods, between different time intervals between test and retest or between sexes. The only difference found was that, on average, the reliability for vigorous activity was higher than for moderate activity.

The responsiveness of a questionnaire was assessed in only two studies,[38,54] and seemed to be poor. The correlation between changes in self-reported PA and changes in supervised activity in a training programme was -0.07 for total energy expenditure and 0.01 for vigorous activity.[38] The correlation of change in PA assessed with an adapted version of the long form of the IPAQ with change in V̇O2max was 0.20 for men and 0.12 for women.[54]

3. Discussion

Although more than 90 papers have been published on the validity or reliability of PA questionnaires, this is the first systematic review of studies assessing the measurement properties of PA questionnaires, in which the results as well as the methodological quality of the individual studies have been taken into account. Our results indicate that the overall methodological quality of the studies could be much improved. Most common flaws were small sample size and inadequate analyses, and for construct validity, comparison measures that were not measuring the same construct.

An important finding of our review was the poor reporting of methods and results of the studies. It was often unclear what dimension of PA the questionnaire was supposed to measure. This made assessing content validity sometimes impossible. Furthermore, it was extremely difficult, if not impossible, to assess whether the same or slightly modified versions of questionnaires were used in some studies, and it was not always clear whether the data were derived from a self-report questionnaire or whether the questionnaire was part of an interview.

For assessing construct validity, it is important to formulate specific hypotheses in advance about expected correlations between the questionnaire under study and other measures. However, almost none of the studies had formulated such hypotheses. To be able to provide levels of evidence we formulated hypotheses regarding the strength of the association between comparison instruments. This methodology is not new, and the idea behind it is that, in retrospect, it is always easy and tempting to come up with explanations for the findings and conclude that the questionnaire is valid. In fact, most studies in our review concluded that the questionnaire under study was valid. However, when we applied our criteria we found that these conclusions were overly optimistic in almost all cases.

Reliability was also often poorly assessed. Many studies used large time intervals between the test and retest, and in most studies Pearson or Spearman correlation coefficients were calculated instead of ICCs or Kappas. This is partly because we included studies performed many years ago, when Pearson correlation was still an accepted method, but nowadays there is a consensus that calculating ICCs or Kappas is the preferred method for assessing reliability.

Only two studies evaluated responsiveness, i.e. the ability of a questionnaire to detect change in PA over time. This is amazing, given the importance of responsiveness of a questionnaire when used in PA intervention studies. If a questionnaire has poor responsiveness, treatment effects cannot be detected, or only with large sample sizes. For some questionnaires, the majority of the population scored the highest or lowest possible score (e.g. with the modified CHAMPS[6]). When this happens, there is little opportunity for change, leading to low responsiveness. Although the methodology of assessing responsiveness tends to be less well understood, there is a consensus that responsiveness should be considered an aspect of validity, in a longitudinal context.[106] While construct validity is about the validity of a single score, responsiveness is about the validity of a change score. This means that similar methods can be applied as for assessing validity to assess the validity of changes in PA scores over time, i.e. stating a priori hypotheses.

We found that correlations between PA questionnaire data and accelerometer data were slightly higher in questionnaires asking about the previous week compared with those asking about a usual week. Often, accelerometers were worn in the week that was captured by the questionnaire. It might be that this explains why higher correlations were found for these questionnaires compared with those that asked about a usual week or usual PA. So, whether questionnaires asking about the previous week are really better in assessing PA, or that this is a consequence of the testing procedures, needs to be determined.

3.1 Limitations of this Review

As with any other systematic review, it is possible we missed some relevant papers with our literature search. We only used the search terms ‘questionnaire’, ‘physical activity’, ‘exercise’ and ‘motor activity’ and did not include alternative wordings, such as ‘survey’. However, after checking all references of relevant papers retrieved in our search, it proved that very few papers were missed.

Because of an overwhelming amount of data available, we had to be selective in what to present in this review. First of all, we chose to limit the review to self-administered questionnaires, realizing that some questionnaires have been used in other forms as well, such as interview-administered. We realized that with this restriction we have ignored some studies on questionnaires that can be either self-administered or used as an interview. The measurement properties of these questionnaires may be different in these two applications. Therefore, by restricting the review to one form of administration, the studies were more homogeneous and we felt better comparisons across questionnaires could be made, without allowing for the type of administration as well. Further, when assessing validity, only correlations with accelerometer data, V̇O2max, BMI and percentage body fat were extracted from the papers, because we felt that, although these are different constructs, these comparison measures were most closely related to the construct being measured in the questionnaires. We have ignored correlations with, for example, cholesterol or blood pressure in these comparisons because only a limited correlation with PA can be expected. Lastly, not all scores resulting from the questionnaires could be presented. We often restricted the information to the overall or total PA scores. Data were presented for men and women separately when relevant (i.e. in case of sex differences).

Interpretation of the results was difficult for some studies, mostly due to poor reporting. Although two reviewers independently extracted data from the papers, interpretation may have been incorrect in some cases. Given the number of studies included in the review, and the number of studies conducted a long time ago, we chose not to contact the authors of the original studies.

Many of the choices for scoring the quality of the studies have been made without a very strong basis on theory or evidence, simply because there is not much available to base these choices on. Others might have chosen different cut-off points for scoring negative or positive on validity or reliability. The same is true for the decision on what is a sufficient sample size and what is the appropriate time interval between test-retest. However, readers can decide according to their own insights and draw their own conclusions from the data provided in the tables.

3.2 Recommendations for Choosing a Questionnaire

Current US recommendations state that every adult should participate 2.5 hours a week in moderate intensity or 75 minutes a week in vigorous intensity aerobic PA or in an equivalent combination of moderate and vigorous intensity activity. Aerobic activity should be performed in episodes of at least 10 minutes, preferably spread throughout the week. Based on these recommendations, questionnaires for measuring total PA should at least measure duration and frequency, and measure PA in all settings (work, home, transport, recreation, sport) to have sufficient content validity. Especially older questionnaires, such as the Baecke questionnaire,[41] do not fulfil this criterion, because insight into what PA for health should entail has changed over time.

Of course, some researchers will need a PA questionnaire not only for measuring total PA but also for different purposes, and different aspects of PA might be relevant for their study. For instance, when looking at bone health, energy expended in cycling or swimming might be less important, but carrying loads would be of interest. So there will not be one questionnaire suitable for all purposes or target groups. The choice for a certain questionnaire should therefore always start with defining the purpose of the study and the PA measurement, after which the content validity of a possible questionnaire should be judged. Only then do construct validity and reliability need to be considered.

In this review, the content of 23 questionnaires was deemed appropriate for the dimension of PA it was intended to measure (Bharati,[45] EPIC original Q,[10] EPAQ2,[9] Harvard/College Alumnus Q,[3,51] the long version of the IPAQ,[14] the adapted IPAQ,[54] Kaiser PA Survey,[56] LACE PA Q,[7] LTPA Q,[61] Mail Survey of PA,[62] Norman Q,[70] NZPAQ-SF,[21] One-week recall Q,[71] PAFQ,[22] PA History Q,[72] PYTPAQ,[26] Singh Q,[77,78] SQUASH,[32] Historical walking, running and jogging questionnaire,[30] NPAQ,[20] Health Insurance Plan of NY,[3] TOQ[31,89] London PA Q[88]). Unfortunately, for only 13 of these 23 questionnaires was both reliability and construct validity studied (Bharati,[45] EPIC original Q,[10] EPAQ2,[9] Harvard/College Alumnus Q,[3,51] Kaiser PA Survey,[56] the long version of the IPAQ,[14] Norman Q,[70] One-week recall Q,[71] PYTPAQ,[26] Singh Q,[77,78] SQUASH,[32] Health Insurance Plan of NY,[3] TOQ[31,89]).

Of the 23 questionnaires with sufficient content validity, the Kaiser PA Survey,[56] the Godin Q,[50] the NPAQ,[20] Bharati Q,[45] the LUS version of the IPAQ,[14] One-week recall Q,[71] and the Health Insurance Plan of NY[3] scored good for reliability at Level 1 or 2. Construct validity was sufficient according to our criteria only for the L7S version of the IPAQ in one study,[92] although validity for the Kaiser PA Survey[56] was 0.49, which is only just below the (arbitrarily chosen) cut-off point of 0.50.

In recent studies, the IPAQ seems to be used most often and it is by far the most widely validated questionnaire at present.[14,9195,97,107] Reliability of the IPAQ was not shown consistently within or between studies, although the short version for the past 7 days (S7S) and the long version for a usual week (LUS) seemed to perform best. We therefore recommend additional reliability studies of the IPAQ. Validity of the IPAQ seems questionable. First, content validity of the short forms seems limited because it does not discriminate between different settings. The long form, which does discriminate between five settings therefore has a better content validity, but it was reported to be “too boring and repetitive” and too long for routine surveillance.[14] The construct validity of both the short and the long forms varied widely, but were mostly below our criteria. Of the self-administered IPAQ forms, only for the L7S was a correlation found with an accelerometer – of 0.52 found in Finland[14] and 0.55 in Sweden[92] – and for the S7S in the US in men only.[95] Discrimination of the IPAQ between groups of people with different activity levels as measured with DLW[94] was questionable, although differentiation between groups with different fitness levels was adequate.[91] Therefore, we feel that additional well designed studies on the measurement properties, with specific attention to responsiveness, of the IPAQ are required.

3.3 Recommendations for Further Research

For future studies, we recommend choosing from the abovementioned 23 questionnaires that we identified as having sufficient content validity, and validating those further for reliability, construct validity and especially responsiveness.

The results of this review indicate that one study on validity and reliability of a questionnaire is not enough. A number of other questionnaires were validated in more than one study, and without exception the results were conflicting: the questionnaires showed sufficient validity in one study and not in another. Also, in the large international study on validity and reliability of the IPAQ, huge differences were found between countries. This indicates that it is important for researchers to assess the measurement properties of a questionnaire in their own language and in their own target population. As the majority of the studies on measurement properties of PA questionnaires have been conducted in the US, it remains to be seen whether the results can be generalized to other countries. We therefore strongly recommend researchers to assess measurement properties of a questionnaire carefully in their own target group.

Although PA questionnaires are frequently used for the evaluation of the effects of intervention, surprisingly little attention has been paid to the responsiveness of these questionnaires. A prerequisite for detecting differences in PA after an intervention would be that the questionnaire is responsive to change. The two studies assessing responsiveness did not show positive results in that regard.

Finally, more attention should be paid to reporting on studies assessing measurement properties of PA questionnaires, since, for instance, it was often unclear what questionnaire was used and for what purpose the questionnaire was intended. The QAPAQ might be a useful tool when reporting on measurement properties.

4. Conclusions

Based on our review of the literature concerning measurement properties of questionnaires measuring PA, no conclusion can be drawn regarding the best questionnaire at the moment. Researchers should determine which questionnaire would fit their purposes best regarding the content of the questionnaire. Questionnaires with good content validity need to be validated in well designed studies and in different countries. Data on the responsiveness of PA questionnaires are urgently needed for the use of questionnaires in intervention studies.