The volume and velocity of data are growing rapidly and big data analytics are being applied to these data in many fields. Population and public health researchers may be unfamiliar with the terminology and statistical methods used in big data. This creates a barrier to the application of big data analytics. The purpose of this glossary is to define terms used in big data and big data analytics and to contextualise these terms. We define the five Vs of big data and provide definitions and distinctions for data mining, machine learning and deep learning, among other terms. We provide key distinctions between big data and statistical analysis methods applied to big data. We contextualise the glossary by providing examples where big data analysis methods have been applied to population and public health research problems and provide brief guidance on how to learn big data analysis methods.
- public health
- research methods
Statistics from Altmetric.com
Big data refers to complex and large amounts of information.1 Big data is massive in volume and collected from a variety of sources including mobile devices, medical databases, satellite images, genome sequences, video and social media feeds. An important feature of big data is that most of this information was not available or did not exist a decade ago.1 The amount of data available is currently at an all-time high and is projected to continue growing exponentially.
Analysis techniques for big data have been applied in healthcare fields and may provide insight for exposure and health outcome measurement. Despite considerable use of big data in clinical and healthcare settings, academic work using big data in the field of population and public health has been limited.
There is an existing published glossary on big data; however, this glossary is highly technical and has limited application to population and public health.2 Big data and associated analysis techniques hold considerable promise for population health research. However, to date, population health research has been a late adopter of big data, perhaps owing in part to a limited understanding of key concepts and terms.
The five Vs of big data
The term big data appears to be self-explanatory, but there are multiple characteristics which make big data unique. Doug Laney3 is credited as the first to define big data using the ‘three Vs of Big Data’: volume, velocity and variety. As Gandomi and Haider4 state, big data is not only about the size, it is also about the speed at which it accumulates and the number of sources from which data can be gathered. With the increase in use of big data, two additional ‘Vs’, veracity and value, have been included to better characterise the field.4 Figure 1 presents a graphical illustration of the 5 Vs of big data.
Volume refers to the massive amounts of data that can be collected and stored. This amount is constantly growing. For example, NASA Earth Observing System Data and Information System manages more than nine petabytes (9 000 000 gigabytes).5 Although these amounts are large, what is considered large for different fields varies. Within the population health context, the use of electronic health records (EHRs) has considerably increased the volume of data available to clinicians and researchers.6 These files contain personal medical records which may include patient notes, radiology images and patient’s genomic data.7 This information, along with new forms of healthcare and fitness wearable data (eg, 3D imaging, biometric sensors, Global Positioning Systems data) will continue to grow the volume of health data available.
We argue that data volume in population health is often functionally defined by the survey and statistical methods used. As data volumes grow, big data analysis techniques will be required to address new challenges. For example, in genomic data, a massive number of independent variables are derived from individuals, leading the field to develop new methods to deal with multiple comparisons, noise and collinearity concerns.8 As a converse example, real time monitoring of blood glucose levels can provide enough data that almost all between group comparisons will reject the null hypothesis, not because the effect size is large or meaningful, but because the sample size is large enough to register statistically significant differences between closely spaced means.9
Velocity is the speed at which data is generated. Traditional survey or census type data is low velocity, being collected monthly or yearly.3 EHRs, health data collected by patients and social media feeds produce higher velocity data. Data may be generated from daily measurement (eg, a person’s weight), at multiple intervals per second (eg, accelerometer data collected at 100 measures per second) or in real time (eg, heart rate monitoring during surgery).7
Variety refers to the inclusion and use of multiple types of data in data analysis.10 The combination of multiple data sources allows researchers to better understand population health. For example, PopHR, a project lead by Dr David Buckeridge at McGill University, is a web application that assists in integrating data from a variety of sources including publicly available administrative data, EHRs and survey responses administered by research teams. Linking a variety of data sources has the potential to provide new insights into population health.11
Veracity is ensuring big data analysis and outcomes are accurate and credible.12 The veracity of big data is crucial for health researchers to make evidence-based decisions for population health. There are a number of concerns about the veracity of big data sources including their generalisability and accuracy. Patient data may be systematically incomplete or difficult to decipher. For example, the accurate detection of handwriting from physician notes is crucial for safe decision-making about medicine doses for patients.13 14 Analyses and predictions based on EHRs may not be generalisable to all patients and may contain errors.7 12 Social media data, such as those from Twitter or Facebook, may systematically over or under represent certain population groups, making the population representativeness of big data questionable.15
Value refers to the types of insights that can be gained from big data collection and analysis. Individual data sources may have limited value, but when collected on massive scale, analysed using big data analysis techniques and tested for veracity, these data gain value.4
Big data analysis techniques
The term big data often refers inclusively to the data (five Vs) and the analysis techniques used to analyse these data. However, the data and the analysis techniques are separate. In fact, big data analysis techniques can be applied to data of any size.
The coining of the term machine learning is often credited to computer scientist Arthur Samuel who developed a machine that could defeat humans in the game of checkers.16 More recently, Tom Mitchell has explained machine learning as ‘a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E’).17 Modern machine learning involves a number of specific methods including neural networks, decision trees, nearest neighbour classifiers, support vector machines and Markov and hidden Markov models.18 These methods can be used for supervised or unsupervised learning, described below.
Labelled, training, and test set data
Labelled data are data where the value of the variable to be predicted is known. In linear regression, this is the outcome, Y, or left-hand side variable.
Training data are data used to train a machine learning model. Machine learning methods use features (ie, variables) to predict an outcome. However, unlike more traditional methods, the primary objective of the training process is predicting as much of the variance in the data as possible. Training data can be applied to labelled data in supervised learning or unlabelled data in unsupervised learning.
Test set data are labelled data that are not included in the training process, but that are used to validate the model developed using the training data. This allows for an estimate of the model performance, typically done using area under receiver operating characteristic curves, common in survival analyses familiar to population health researchers.19 Two common methods for generating training and test set data are split sample and k-fold cross-validation.20
Supervised learning is a form of machine learning where a model is build linking known inputs (independent variables) to measured outputs (dependent variables), usually with the intent to apply the model to unlabelled inputs in the future. It is most similar to, and includes regression techniques. Supervised learning is typically divided into classification and prediction tasks. The primary objective of supervised machine learning is to use the trained model to predict the outcome when data are unlabelled, where the outcome is unknown.21
Classification involves identifying the category where an observation belongs, given known category labels.20 Logistic regression is an example of a classifier from statistics.
Prediction, similar to traditional regression analyses, aims to identify variables associated with the outcome variable. In supervised learning, a model is trained using labelled data to predict a response variable with a given feature set (ie, set of variables). In epidemiological terms, a machine learning method uses data to predict an outcome variable using exposure and confounding variables. For example, a supervised learning method was employed on state-level demographic variables (age, sex, race, household income, employment status, marital status and education) to predict state level prevalence of six non-communicable conditions (high blood pressure, obesity, coronary heart disease, heart attack, stroke, diabetes).22 Overall, the predictions from the machine learning model had correlations >0.80 with state-level prevalence of the non-communicable conditions.
Unsupervised learning is a discovery form of machine learning. Machines use input data to discover patterns and relationships within the data set without a specific outcome variable of interest.21 Unsupervised learning has been used to determine the typical progression of chronic obstructive pulmonary disease,23 the relationships between certain phenotypes and asthma (eg, allergy symptoms, nasal polyps, lung function)24 and through unsupervised deep learning of EHRs to predict future disease incidence.25 Principle component analysis and associated data ordination processes are examples of unsupervised machine learning algorithms.
Data mining (sometimes referred to as knowledge discovery in databases) is the process of extracting new and at times useful information from data.26 Data mining and machine learning often use the same statistical techniques and it is difficult to differentiate the two in practice.26 Some would argue the primary focus of data mining is unsupervised learning. Drug pathway discovery through analysis of published results is an example of data mining in health research.27 Perhaps, data mining can be better conceived as data refining, where large volumes of data are sifted using statistical techniques to find potential associations of interest for researchers.
Artificial intelligence (AI) is used to describe machines that perform human-like activities such as learning, perception, problem solving and playing games.28 AI has been used to engage the public by improving the quality of eHealth interactions. For example, patients can use AI-based eHealth applications to receive personalised information.29 Chronology MD was developed for patients with Crohn’s disease; this programme allows patients to input their ‘observations of daily living’ and an AI system assists patients with management of their disease (eg, medication reminders, exercise and proper sleep motivation).29 This case highlights how AI applications can increase the immediacy of eHealth, the development of closeness and the feeling of an authentic, caring relationship. These applications help to provide a human-like element to eHealth exchanges between patents and AI systems.
Deep learning is a machine learning technique designed to process signals like a human brain.20 Instead of using a single machine learning technique on a single type of data, deep learning uses multiple machine learning methods and layers of data to perform abstract learning tasks. To date, population health-related examples of deep learning are difficult to identify. An example of deep learning is an image recognition to image caption process. First, an image detection machine (eg, Vision Deep Convoluted Neural Network) identifies the items in an image, then based on those items a language generating machine (eg, recurrent neural nets) uses the data to generate a caption about the image.30 These processes can allow for detection and creation of abstract and creative objects such as painting,31 and music.32
Population health relevant examples of big data analysis
Text analytics refers to the process of compiling and analysing text to derive meaningful information.4 Machines use algorithms to derive patterns and develop categories within text. Machine learning methods for text analytics can extract specific information, summarise and simplify, provide question and answers (eg, Apple’s Siri) and analyse documents for sentiments and opinions. For example, Twitter data has been used to predict income and socioeconomic status.33 Preoţiuc-Pietro et al 33 used Twitter data and supervised learning techniques, logistic regression with Elastic Net regularisation34 and Support Vector regression35 with a Radial Basis Function kernel, to profile features, inferred psychological and demographic features, emotions and word clusters to predict income. The correlation between the predicted model and income data was 0.63 with a Mean Average Error of 9535£.
Image analytics refers to techniques that use machines to derive information from image data.36 Images can be photos or satellite information. Image analytics can detect specific features in an image such as a face or type of animal. For example, Pandey and colleagues37 used a number of different machine learning methods to predict the association between ultrafine particles and fine particulate matter (PM1.0), and road traffic and weather factors using satellite image data.
Video analytics (also referred to as video content analysis), uses machine learning to evaluate video footage to extract important details.4 Video analytics have been applied to closed-circuit television and video streaming services, such as YouTube, for object detection and tracking, behavioural analysis, and detection of ‘interesting events’.38 For example, Zangenehpour et al 39 used 90 hours of video at 23 intersections in Montreal to examine the safety of cyclists–driver interactions at intersections with cycle tracks. The authors used TrafficIntelligence, developed by Dr. Nicolas Saunier, to detect and classify road users, select and predict trajectories, and calculate post encroachment time (a measure of safety).40
Audio analytics are used to analyse and derive information from audio data.4 Machine learning methods can be developed to extract information from unstructured audio data or to detect the presence of events within audio data.41 For example, Cheffena41 used four different machine learning methods (k-nearest neighbour, support vector machine, least squares method and a neural network) to detect falls from phone audio, when the phone was within 5 m of the participant. This method helps overcome limitations of accelerometer-based methods that require participants to be wearing the phone for falls detection. The neural network method had sensitivity, specificity and accuracy above 98%.41
Learning big data analysis methods
Learning big data analysis methods may be one of the biggest challenges for population and public health researchers. The programming requirements and translating between epidemiological and machine learning languages can seem daunting.20 That said, many of the statistical concepts will be familiar to population and public health researchers.
Many machine learning tools are freely available, have introductory tutorials, and considerable online support communities through sites like StackOverflow.42 Python (a general programming language) and R (a statistical programming language) are commonly used for machine learning. The Python module Scikit-learn includes built in functions for a number of common machine learning methods.43 Similar to Scikit-learn, caret is an R package with functions for a number of machine learning methods.44 Weka is a free Java-based application for machine learning.45 As stand-alone software, Weka may offer a more accessible introduction to machine learning for those familiar with statistics but not programming.
Critiques of big data for population and public health
Automating research changes the definition of knowledge
Boyd and Crawford argue that big data has changed how we understand the construction of knowledge. That Big Data will allow us to understand complexities, such as human behaviour. This minimalises the importance of other areas of knowledge creation and ignores limitations of big data.
Claims of objectivity are misleading
Big data can perpetuate the myth that quantitative data analysis is inherently objective. Big data analysis requires human decision-making during cleaning, analysis and interpretation of findings and these are not solely objective processes.
Bigger data are not always better data
Although one may have access to millions of social media posts, these data may not be representative of the population, may be from bots rather than people and may only include a few variables of interest for population health.
Not all data are equivalent
Although similar data may appear to contain similar variables, differences in sources and means of collection may make these data considerably different. Similar studies using social media or cell phone based communication data may not show consistent findings. Researchers must consider the data context and limitations before generalising results between data sources.
Just because it is accessible does not make it ethical
Many social media sites collect data on users and make it available to researchers without explicit permission. Informed consent is required in research involving human subjects, but not necessarily in big data. Researchers must adhere to general ethical principles including respect for persons, concern for welfare and justice.
Limited access to big data creates new digital divides
Big data is often owned by the entity that collects it, giving them control over who can access their data and at what cost. High cost of data access can create disparities in who can conduct research. Further, if the collector decides to make their data inaccessible to outside analysis, others cannot correctly evaluate the data. It is important to consider who is accessing the data and asking questions when interpreting findings from this research.
Critical discussion about big data is important. Experimental design, valid and reliable measurement and thinking deeply about causality remain the fundamental building blocks of population health research. However, population and public health researchers should not shy away from learning about big data and big data analysis methods. These new data sources and advanced analysis methods can help us better understand the unknown,48 better understand the context of population health interventions, and make our analysis methods reflect the conceptual systems perspective in our research.
Big data hubris
Big data hubris is the assumption that data with sufficient volume and velocity can compensate for or eliminate the need for high veracity data, high-quality study designs and more traditional forms of data analysis.47 48 More data are always better for statistical significance, but absent-validated models or appropriate study design, more data may not be enough for population health intervention research. The veracity of the data is closely related to the complexity of the big data analysis techniques required to obtain meaning from data. With high-velocity heart rate data, it is relatively easy to draw meaning from the data because these data correlate closely with heart conditions and are not easily confounded. For Twitter feeds containing the word ‘flu’, the data are not closely correlated with influenza prevalence and confounded by a number of factors.
Lazer and colleagues48 provide a compelling example of big data hubris in the Google Flu Trends research. An important limitation of Google Flu Trends was that the underlying algorithms and methodology of Google search terms are proprietary and evolving (Google’s Hummingbird algorithm likely uses deep learning).49 Core scientific principles of replicability and transparency are difficult when dealing with proprietary data, whether it be from Google, Facebook or others.
The goal of this glossary was to build greater awareness and understanding of big data concepts to increase its use within population health research. Big data is continuously increasing in size and potential application. Big data analytics can be applied to data sets population health researchers are familiar with, such as census and EHRs. Additional sources of data, like text, video, audio and images can be analysed using machine learning. Machines can be programmed to extract specific information or used to search data sources for potential relationships. This information can be used to identify the complex relationship between societal and individual determinants of health.50
As big data is applied more broadly, we will need to consider Boyd and Crawford’s47 provocations for big data. Discussion will be required surrounding its ethics, objectivity and veracity, with Google Flu Trends as an important lesson in big data hubris. Regardless, as database sizes grow, researchers will need to adapt their analyses, and incorporation of big data analysis techniques may offer solutions. Some researchers have begun applying big data analytics to these data sources but many are late adopters. Researchers have the ability to incorporate a variety of data sources into analyses from social media, census, ‘wearables’ and satellite imagery for use within a single research project. Learning big data analysis methods should be a priority for population health researchers, as there is demonstrable potential for big data-driven insights within population health.
Big data and big data analysis techniques have been underused in population and public health
We provide definitions for key concepts of big data and big data analysis
We give examples of applications of big data analysis to population and public health problems
Critiques of big data, including big data hubris, are discussed
The authors acknowledge Mr Cameron Melville (https://www.cameronmelville.com/) for assistance with graphic design and development of the five Vs of Big Data figure.
Contributors DF and KS were responsible for the conceptualisation of this manuscript. DF and RB contributed to the definitions and references for the glossary. All three authors contributed to the writing and editing of this manuscript. All authors have read and approved the final submitted version of this manuscript.
Funding Funding for this paper was provided by Dr. Fuller’s Canada Research Chair.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.