Article Text

Download PDFPDF

Use of administrative medical databases in population-based research
  1. Natalie Gavrielov-Yusim,
  2. Michael Friger
  1. Department of Epidemiology and Biostatistics, Ben-Gurion University of the Negev, Beer-Sheva, Israel
  1. Correspondence to Natalie Gavrielov-Yusim, Department of Epidemiology and Biostatistics, Ben-Gurion University of the Negev, P.O.B 653, Beer-Sheva 8410501, Israel; nataliag{at}


Administrative medical databases are massive repositories of data collected in healthcare for various purposes. Such databases are maintained in hospitals, health maintenance organisations and health insurance organisations. Administrative databases may contain medical claims for reimbursement, records of health services, medical procedures, prescriptions, and diagnoses information. It is clear that such systems may provide a valuable variety of clinical and demographic information as well as an on-going process of data collection. In general, information gathering in these databases does not initially presume and is not planned for research purposes. Nonetheless, administrative databases may be used as a robust research tool. In this article, we address the subject of public health research that employs administrative data. We discuss the biases and the limitations of such research, as well as other important epidemiological and biostatistical key points specific to administrative database studies.

  • Cohort studies

Statistics from


Observational epidemiologic studies seek to acquire data on sufficiently large and representative sample of subjects, which can be analysed to provide meaningful, valid and generalisable findings. In most cases, a field study that fulfils all these criteria requires considerable infrastructure including participants’ recruitment, examination, follow-up and the storage of specimens or other study material. Since scientific research is often limited in resources, cost-effective alternatives of traditional observational studies are needed.

Administrative databases are massive repositories of data collected in healthcare for various purposes. Such databases are mainly maintained in hospitals, health maintenance organisations and health insurance organisations. Administrative data may include claims for reimbursement, records of health services, medical procedures, prescriptions and diagnoses information. It is therefore obvious that administrative databases provide a variety of already stored data with an on-going collection process. In addition to that, they may provide an infrastructural basis for new data collection, which was not originally planned in the system, with minimal investment of logistics and time.

In general, information gathering in such databases is not intently planned for research purposes. Most of the administrative databases were originally set up as monitoring tools for health policymakers’ use. Their primary use was to track healthcare systems’ activity from an administrative and financial point of view. Consequently, administrative databases differ from other medical data repositories, such as electronic health records. Whereas the former chiefly intends to store financial and administrative information for medical insurers’ and providers’ use, the latter is mainly used by clinicians to document patients’ clinical condition. In the upcoming sections we will explain which research fields can benefit from administrative databases and what are the key points and pitfalls of such studies.

Research fields making use of administrative databases

In the last two decades many epidemiology-related fields have adopted administrative databases as their main choice of data source (see examples in table 1). Pharmacoepidemiology is one of the fields that seem to have greatly benefitted from this trend. Pharmacoepidemiology deals with low-frequency or long-term adverse events of drugs and vaccines and requires considerable, sometimes enormous, sample sizes, as well as extended follow-up periods. In general, case-control or cohort designs are applicable in this field. However, case-control studies in pharmacoepidemiology have been strongly criticised,1 due to their high susceptibility to selection and recall biases. Moreover, case-control studies do not estimate the absolute risk or the incidence of adverse outcomes in the population.2 Using administrative databases allows longitudinal design, incidence calculation, large sample size and robust power even for very rare events, along with a relatively short and inexpensive study design.

Table 1

Examples of medical databases used in research

Two other research types that can benefit from using administrative databases are studies requiring massive data collection, such as large epidemiologic surveys and epidemiologic surveillance. Surveys, also called cross-sectional studies, reveal a point-prevalence of risk factors and outcomes in the population. When a research question requires an investigation of the disease’s trend, the surveys must be consecutively repeated in several points in time. This demands an elaborate setup, including facilities, as well as recruitment and training of a special team. In view of their time-consuming and finance-consuming nature, surveys are often applied to investigate most prevalent health problems, with greater public interest (eg, obesity, smoking, education, violence).3–5 Accordingly, it might be difficult to justify and dedicate similar resources for surveying less prioritised outcomes and diseases. Furthermore, survey results may be strongly affected by self-report and non-participation, which can lead to information and selection bias, respectively. Another major shortcoming of point-prevalence studies is that they detect associations between epidemiologic factors and do not reveal causative relationships between them. Consecutive point-prevalence studies may provide the desired causality insight. However, large surveys, as a rule, are not able to collect data from the same participants, and cover each time a different representative sample of the population. Therefore, this longitudinal perspective is ecologic and not individual-based. Using administrative databases as data source can resolve most of these issues since it provides a variety of medical and personal data, with individual follow-up, and does not require the personal involvement of study participants.

Active surveillances are executed while studying dynamically changing epidemiologic trends based on longitudinal data. This design requires an active ongoing data collection in prespecified time intervals. It also demands considerable resource dedication and is usually undertaken for a selected group of specially prioritised epidemiologic issues. Most often this tool is used to investigate infectious diseases’ trends.6–10 Administrative databases can provide a suitable cost-effective alternative of these resource-demanding studies.

Although administrative databases have many advantages serving epidemiologic research well, they also have limitations which must be considered in the process of planning, executing and interpreting research findings. The next sections describe these limitations and the ways they can be avoided or treated.

Information bias in administrative databases research

In general, administrative data research suffers from the same biases as field studies. These biases may be broadly classified as belonging to either information or selection bias family. This section deals with an information bias which occurs due to imperfect data collection within administrative database, and is mainly expressed in misclassification of the research exposure, outcome or both.21

To simplify, we will first address the issue of outcome misclassification. The general formulation of this problem is—when using administrative data to study a certain disease as an outcome, how do we make sure that the condition documented in patients’ records is indeed a suitable representative of the disease under investigation? Misclassification of the outcome may be a result of erroneous or unclear clinical documentation as well as a result of misdiagnosis. The most trivial example for this subject is identifying influenza cases using administrative data. The clinical definition of influenza-like illness is very wide and may include a range of ailments caused by various respiratory or other viruses.22 To maximise validity in such settings, a meticulous case ascertainment must precede data analysis. Specifically in the influenza example, in addition to retrieving explicit ‘influenza’ diagnoses, other conditions, listed in the standardised case definition of influenza, as defined by WHO,22 must be retrieved and used in outcome definition.

Exposure misclassifications, for example, misclassified use of medication, have been previously addressed in the literature.23 Medication use is improperly documented in most administrative databases if drug purchase is not reimbursed. Such is the case with privately purchased medications, drugs under restrictive coverage policies23 and over-the-counter drugs.20 Another challenge in pharmaceutical exposure is the issue of treatment adherence and compliance. Pharmaceutical databases provide information regarding whether a patient received a drug prescription or purchased it. However, this gives little indication regarding whether the patient indeed ingested the drug and in what dosage.20 These issues may be resolved using either an additional data source, or by running a pilot study on a group of participants, from whom relevant information may be received through personal interviews. Given that the pilot will include a large enough and representative group, it should provide sufficient basis for assumption on missing data in the entire dataset.

Information bias may often arise while using administrative databases in countries with coexisting universal public and private healthcare systems. In Israel, where all citizens are covered by national insurance, there is a growing trend with the insured preferring private clinics over public medicine.24 However, private visits are not always registered in the main healthcare system, depending on the type of clinic and consultation. This way, retrieving data on medical visits will lead to erroneous measurement of visits number. Because even though the participants are sampled correctly, there is a certain subset of them that supplies partial or flawed information.

The challenges listed here may not always be easy to resolve, since administrative databases are not initially designed for research. The quality of information in these systems depends greatly on specific incentives of data reporting, the most prevalent of which is financial. In other words, information in administrative databases is most accurately represented when it has important administrative or financial implications. As a result, expensive medical procedures are documented better than less costly, but nonetheless clinically important, health interventions. Bearing this in mind provides a perspective of which studies may be performed based on administrative data alone, and which studies require complementary data sources, such as pilot-scale field investigation performed on a subset of study participants. Such pilot will provide an insight on the validity of variables retrieved from the database and on their usefulness in the investigation.

Alternatively, researches may try to construct surrogate variables, which validly represent the study outcome/exposure, but unlike it have a stronger administrative or financial implication, and therefore, have a better chance to be well represented in the database. For example, sometimes the study population may be easier and more accurately captured using the number of patients ‘treated for the disease’ (data retrieved from reimbursed prescription database) rather than the number of ‘diagnosed patients’ (data retrieved from administratively reported diagnoses). Depending on the purpose and process of the diagnoses report, these data may vary by reporting clinic, year and type of healthcare insurer and provider. However, the documentation of drug purchase and reimbursement is fairly stable and uniform. It is important to stress that such method of variable definition should always be preceded by a preliminary validation study, which must demonstrate the degree of correlation between the variable and its surrogate.

This issue is further complicated by the fact that clinical guidelines and definitions may change, as well as the clinical coding system, which may be transformed with time (eg, ICD9 substituted with ICD10 codes). There is no way to avoid such complications, but to accurately state in the methods section the system and the definitions used for case identification in the study. In view of this problem, lately much of the scientific work on administrative databases is being dedicated to developing uniform clinical algorithms which are intended to identify patients with certain diagnosis with maximal accuracy.25–27

Another type of information bias detected in many database studies, and thoroughly discussed in this context by Prof Suissa,12 ,28 is called ‘immortal time bias’. As explained in the referenced publications, this bias occurs because a certain initial interval of the follow-up period is erroneously classified as exposed while in fact being unexposed. Therefore, this interval called immortal time, adds guaranteed protected survival time to the exposed group and systematically distorts causal associations. This bias inflates the survival of the treated group and overestimates the protective effect of treatment.28 Although this bias is not intrinsic to administrative data studies, it often appears in them. Several solutions have been proposed for dealing with this problem. First, immortal bias can be avoided by carefully defining the follow-up period during data retrieval from the database. Specifically, the definition of index time for the exposed and unexposed cohorts must be equivalent.12 Correcting for immortal time bias is also possible during the statistical analysis, using person-time modelling where the immortal period is classified as unexposed period.28 Alternatively, Cox proportional hazard may be used, where the exposure is modelled in a time-dependent manner.

Area-level data

Area-level data, otherwise named group-level or aggregate data, are ecologic type of information collected for individuals in administrative databases. The reason that some data is only provided in aggregated form is patients’ privacy. In particular, area-level data maintenance is applied for data of personal character, such as income, race and ethnicity. Aside from ethical and privacy issues, this type of personal data is not always practically attainable.29

To explain the nature of area-level data, we will use a patient's personal income as an example. This parameter is not routinely collected in administrative databases. However, patients’ socioeconomic status (SES) may be inferred from the general socio-demographic composition of their area of residence. The method of data imputation, which is used to produce an SES proxy based on patients’ area of residence, is called ‘geocoding’.29 This technique has been proposed and practiced for imputation of ethnicity and SES in the administrative databases.29 ,30 Essentially, geocoding is used to link between two sets of data. The first one is the list of patients and their addresses, derived from the administrative database. The second is census-derived information, such as rates of poverty, levels of education and employment, ethnic and racial composition, on geographically defined areas of residence. Based on the combination of these data, the patients’ SES is inferred. The smaller the chosen area of residence is (city, zip code area, census tract, neighbourhood), the more homogenous it is in terms of its sociodemographic composition.

Area-level data is considered to be a major limitation of administrative database studies, mainly due to the possibility of ecological fallacy and risk factor misclassification introduced into analysis by aggregate statistics.31 Nonetheless, in the absence of individual-level data, aggregate data is considered as an acceptable and valid proxy.32 Moreover, in certain study types, area-level information is more useful and valuable than individual data. For example, in studies focusing on health disparities, area-level characteristics provide more than an estimate of participants’ SES, but also add the sociodemographic context they reside in. This type of information entails the concentration of poverty in the given neighbourhood, the accessibility of medical services and other environmental attributes that cannot be reduced to an individual level.32–34 Additionally, statistical methodology that offers a systematic improvement of the ecologic fallacy has been proposed by several authors.35 ,36 These methods combine ecological data with small samples of individual-level exposures and outcomes. The biggest shortcoming of this methodology is that linking ecological and individual data may be very challenging in practice.35 ,36 Currently, attempts are being made to improve this methodology and facilitate its application in practice.

In the context of area-level data it is also important to mention the significance of a multilevel analytical approach. Multilevel or hierarchical modelling allows the simultaneous analysis of the higher-level and lower-level data units, such as area-level and individual variables, respectively.37 Administrative databases provide an abundance of information that differs by its type and nature. In order to model such data correctly, it is important to distinguish between the group and individual sources of variability. Models that combine individual variables (eg, patients, students) nested within group variables (eg, neighbourhoods, schools), should be constructed using a multilevel approach. For example, while modelling the variation of treatment compliance in the population, patient-level and physician-level (or clinic-level) variation should be accounted for.


One of the common claims made against administrative data concerns the generalisability, or the external validity of study results. Often times, researchers believe that in order to be considered credible and meaningful, cohort study results must be applicable to large geographically defined populations (ie, residents of a country from which the study group had been drawn). However, this demand is somewhat misleading. Geographical residence is only one of the many characteristics, by which study populations may be defined. Conceptually, ethnicity or SES, may serve as population-defining factors as well. As long as the internal validity of the study has been optimised, the results of the analysis will be valid and applicable to the population under investigation. Accordingly, if the researchers are aware of the characteristics defining the population captured by their databases, there should be no difficulties or mistakes in the interpretation of findings. Therefore, prior to conducting research on administrative data, investigators should find out the sociodemographic composition of the population described by their database, and carefully define the group to which study conclusions will be applicable.

Statistical issues in administrative databases research

As was previously mentioned, administrative databases provide massive sample sizes. This section focuses on the specific features of statistical inference in large datasets. The first methodological issue characteristic of large samples is related to Hosmer–Lemeshow (HL) test, which is used to assess goodness-of-fit in logistic regression.38 This test divides the sample into a number of groups, with most statistical packages using 10 groups as a default. The null hypothesis of the test states that the model fits the data. Significant departures from the tested model indicate that the regression is poorly fitted. As with any statistical test, the power of the HL procedure increases with sample size. Therefore, in large samples even small departures from the proposed model will be considered significant, demanding rejection of the regression model. In this case, the effect of high power is undesirable, since the likelihood of model rejection must be independent of sample size. In view of this problem, the authors of the original HL test have recently provided a set of recommendations, which explain how to account for sample size in the HL test, and how to eliminate its influence on the model fit estimation.39

Another subject, which has to be considered when working with large datasets, is that of the statistical significance and its interpretation. The sole use of statistical significance in hypothesis testing and results interpretation had been criticised in the past.40 ,41 At present, the idea that clinical relevance must be demonstrated along with statistical significance of the association had already become a consensus. However, in large-scale database studies, this issue is even more noteworthy than in clinical field research. In general, given a large enough dataset, all comparisons may yield statistically significant differences, even those of lowest magnitude. Therefore, using statistical significance as a discriminatory factor in model construction becomes in many cases impractical. As a result, clinical relevance of the effect size is extremely important in outcome-modelling within large datasets, and sometimes may replace altogether the bivariate pretesting of risk factors with the outcome. In view of the fact that clinical importance cannot always be prespecified, there are several methodological options which are designed to help investigators assess and define it.42–44

Confounding in administrative databases research

Confounding in administrative database research is largely similar to that found in other epidemiologic investigations. Nonetheless, there is an issue related to residual confounding, which deserves special attention in the context of database studies. Due to robust design, the statistical estimates of association derived from database studies tend to have very narrow CIs. Theoretically, this indicates high precision of the revealed associations. However, a dataset retrieved from administrative database often lacks covariates, which may be critical in the model adjustment in a given investigation. Although administrative data may have valuable information on main risk factors and outcomes of the investigated subject, it will not necessarily provide all the required confounders. In administrative databases, where data collection by definition is not fine tuned for any specific research, we use whatever data is available, rather than required data. In such cases, statistical models may be insufficiently adjusted and suffer from considerable residual confounding. This combination of highly precise but confounded results poses a special hazard of a biased findings’ interpretation. Sometimes surrogate variables found in administrative databases may effectively substitute the missing covariates. For example, the parameter of health-mindedness may be represented by a combination of participants’ use of preventive medical services, such as immunisation, participation in community health programmes, visiting dietician and so on.

Data linkage in administrative database research

As was previously mentioned, many of the methodological issues in database studies may be resolved by joining administrative data to another data source by means of record linkage. Such data linkage creates a more comprehensive and integrated dataset. In some studies, this step is not merely beneficial but crucial. For instance, in the USA, the low-income elderly and disabled may be covered by Medicare and Medicaid. Thus, data on some outcomes and exposures have to be received from both sources. Record linkage must be anonymous and individual. It requires the presence of personal identifiers in original databases, which must be removed after linkage. In some databases, for example, in Nordic countries or in Israel, this task is fairly straightforward, because patients are registered using a unique personal identification number, which every citizen receives at birth or upon immigration.20 In databases not employing a uniform individual identifier, this task is much more challenging and may require an elaborate linkage algorithm.

To summarise, administrative databases may serve as a potent tool in public health research. Being aware of the pitfalls specific to this data source will assist researches to attain a valid and effective study design. Additional research is required to adjust epidemiologic and statistical methodology to administrative database investigations.

What is already known on this subject

  • Nowadays, in the era of computerisation, most systems, medical and others, are being set up for digital information collection and storage.

  • Due to this data availability, using massive databases in epidemiological research is becoming increasingly popular.

  • The usefulness of administrative databases in epidemiological study has been recognised and demonstrated during the past years.

What this study adds

  • We attempted to construct a framework of epidemiological and statistical methodological highlights specific to database research.

Policy implications

  • Using this framework may help investigators to avoid the common pitfalls of administrative database research and benefit from the many advantages that this tool can offer.



  • Funding This work was supported by a stipend from the Israel National Institute for Health Policy Research.

  • Contributors NG-Y conceptualised, designed, drafted the initial manuscript and approved the final manuscript as submitted. MF critically reviewed and revised the manuscript, and approved the final manuscript as submitted. Both authors are responsible for the overall content of the manuscript.

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.