Literature DB >> 35844206

Practical data considerations for the modern epidemiology student.

Nguyen K Tran¹, Timothy L Lash², Neal D Goldstein¹.

Abstract

As an inherent part of epidemiologic research, practical decisions made during data collection and analysis have the potential to impact the measurement of disease occurrence as well as statistical and causal inference from the results. However, the computational skills needed to collect, manipulate, and evaluate data have not always been a focus of educational programs, and the increasing interest in "data science" suggest that data literacy has become paramount to ensure valid estimation. In this article, we first motivate such practical concerns for the modern epidemiology student, particularly as it relates to challenges in causal inference; second, we discuss how such concerns may be manifested in typical epidemiological analyses and identify the potential for bias; third, we present a case study that exemplifies the entire process; and finally, we draw attention to resources that can help epidemiology students connect the theoretical underpinning of the science to the practical considerations as described herein.

Entities: Chemical

Keywords: Biostatistics; Causal inference; Data science; Education and training; Epidemiology

Year: 2021 PMID： 35844206 PMCID： PMC9286486 DOI： 10.1016/j.gloepi.2021.100066

Source DB: PubMed Journal: Glob Epidemiol ISSN： 2590-1133

Introduction

We are taught that epidemiologic research often proceeds under a continuum [1]. A research question is conceived, a study is designed and implemented, the analysis is conducted, and interpretation offered. Many epidemiologists receive rigorous training in the theoretical and methodological underpinnings to answer research questions. For example, in observational etiologic research, we learn of six mechanisms under which an exposure, X, may be related to an outcome, Y: (1) chance, (2) uncontrolled confounding, (3) selection bias, (4) information bias, (5) reverse causality, and (6) true causality [2]. And we learn study designs and analysis strategies to limit mechanisms 1 to 5, so that mechanism 6 sends the clearest signal. Epidemiologists who engage in descriptive, experimental, and quasi-experimental work [3] have similar issues to contend with, as do applied epidemiologists. In short, our training emphasizes a rigorous science. For the modern epidemiology student, regardless of sub-discipline or application, inherent in epidemiology is the collection and analysis of electronic data. As such, data literacy is crucial to the success of this field, yet the computational skills necessary to collect, clean, and analyse data are often taught separate from traditional training in epidemiology. A 2019 review of graduate curricula among 20 master’s level public health programs in the U.S. noted that training in “data science”, which includes methods of data management and manipulation, was rarely required as a standalone course in contemporary epidemiology programs, and there was a clear delineation between coursework in epidemiology and biostatistics [4]. While this review was unable to evaluate whether such data literacy skills were integrated within existing epidemiology or biostatistics courses, the growing interest in and use of “big data” sources necessitate greater emphasis on pragmatic considerations when conducting epidemiologic studies. Even a well-conceived, elegant theoretical model studying a pressing public health problem could be derailed by a single poorly conceived, haphazardly measured variable (especially true if this variable were X or Y). We contend that these more practical data decisions are as important for the student to learn as the theoretical and methodological concerns of the practicing epidemiologist, and use this article to outline these more pragmatic considerations and how they interact with the underpinnings of modern-day epidemiology [2,5-8]. Our intention is to demonstrate how decisions about data impact causal inference in observational research, but our observations are germane to all sub-disciplines and applications of epidemiology including descriptive studies, experimental, quasi-experimental studies, and fieldwork. Our audience is trainees in epidemiology - clinical or nonclinical - or anyone who will be undertaking an epidemiological inquiry. To illustrate our points, we provide six prototypical examples, their potential impact on the interpretation of results, and possible solutions, summarized in Table 1, and follow with a use case further demonstrating these issues in a real-world study.

Table 1

Summary of the potential impact on causal inference given various practical considerations of epidemiological data.

Research stage	Practical consideration	Hypothetical example	Potential impact on inference	Possible technical solutions
Data Management	1. Missing data	Complete case analysis omits important data	Selection or information bias	Multiple imputations using chained equations [16] or inverse probability weights of censoring [15,29]
	2. Duplicate observations	Data reported from a registry without de-duplication	Selection bias	Calculate predicted values of record linkage [18]
	3. Inconsistent variable definition	Data linkage with resulting inconsistent operationalization	Information bias	Comparison of linked and unlinked data, sensitivity analysis of linkage procedure [19]
Analysis	4. Study design	Failure to consider an appropriate model for the survey design	Biased error	Evaluate research questions and hypotheses to implement appropriate model
	5. Model specification and assumption	Unresolved heteroskedasticity, relationship not linear, or correlated observations	Biased error	Evaluate distributions of data and use of model diagnostics
	6. Variable selection	Inclusion or omission of covariates in the statistical model, mismeasurement of key variables	Uncontrolled confounding or information bias	Causal diagrams [31], bias analysis [39], E-values [41], or negative controls [42]

We note at the outset several defining features of our commentary. First, we deliberately use the terms data management and analysis broadly. By data management we refer to the process of collecting data, including strategies on accessing, harmonizing, cleaning, storing, and preparing data for analysis. By analysis we refer to the process of producing the P(Y|X) estimation in statistical software; in other words, quantifying the exposure to outcome relation. We refer to an “analytic dataset” as the basis for all statistical modelling and the product of data management. Second, although we treat each of the strategies within the research continuum distinctly for didactic presentation, we recognize that they are not mutually exclusive. For example, misspecification of a variable’s definition during the study design can impact decisions about data management and analysis, inducing spurious associations between X and Y. Third, we do not offer new definitions, frameworks, or theories of epidemiology, and the generalizations made herein may not be true among all epidemiology training programs. Fourth and specific to the use case, it is not our intention to critique the authors or findings of the cited studies, but rather to demonstrate the potential for invalid inference based on assumptions made during data management and analysis.

Data management

Broadly speaking, study data in an analytic dataset can fall under two categorizations. Data that describe the “rows” in a dataset, i.e., the observations, and data that describe the “columns” of a dataset, i.e., the variables. Note that the scenarios described below may occur without the awareness of the researcher. Practical issues may not represent fatal flaws in the data whereby software would flag an error drawing the researcher’s attention to them. Rather these seemingly innocuous issues can slip through undetected and wreak havoc in the final analysis. The theoretical underpinnings of missing data and its influence on causal inference have been well described [9,10]. Missing data may occur for both the observations and variables, where the reason may (or may not) be related to the observed, apparent data [11]. Analysis decisions about missing data may induce selection or information bias, such as when the analyst conducts a complete case analysis where observations (or variables) were discarded from the analytic sample because data were incomplete. Such analytic procedure assumes non-informative missingness, a common assumption of statistics models. As a result, P(Y|X) in the analytic sample loses internal validity and is not reflective of the source population [12]. Simple descriptive statistics [13] and causal diagrams [14] may reveal the patterns of missingness and determine the most appropriate remediation [15,16]. As opposed to missing data that represent a lack of information, duplicate observations represent excessive information, and may have resulted from an incorrect data merge (append) operation, and thus represent invalid rows. Duplicate observations are often a by-product of many reporting tools integrated within electronic health records and registries, and as a result, deduplication algorithms are commonplace. However, for data privacy and security purposes, data linkage may be performed by a third party, which makes it difficult to determine the quality of linkage. Depending upon the proportion of invalid duplicate observations and the strength of P(Y|X), the causal estimand may be biased from the influence of the extraneous data. One way of thinking about this systematic error is through selection bias, in that the probability of inclusion in the analytic sample for any one person is conditional on the variables used to merge the two datasets together [17]. Thus, in the case of duplicate observations, there are unequal selection probabilities for those individuals. One recommended approach to evaluate the degree of duplication is to calculate the percentage of observed versus potential number of record linkage [18]. The use of inconsistent variable definitions, incorrect constructs, and other problems that arise during data management and cleaning may also impact causal inference. For instance, when multiple datasets are linked together, as is the case when separate instruments were used and the data were recorded in separate files, the operationalization of the variables may have differed. As a simple example, this could be a coding of 0 = male and 1 = female in the first file, and 0 = female and 1 = male in the second file. Failure to recognize this inconsistency can induce an information bias in the final analysis without the researcher ever being aware [17]. This could also affect continuous variables, if, for example, one dataset defined weight in pounds and the other in kilograms. Exploratory data analysis should reveal a potential problem, but this may be subtle if the scales substantially overlap. Furthermore, several recommended practices have been proposed for evaluating the impact of data linkage error such as comparing the linked data to a training dataset or gold standard, comparing linked and unlinked data, and sensitivity analysis to evaluate how robust results are to different linkage procedures [19]. See Doidge and Harron for a summary of strengths and limitations of these methods [17].

Analysis

Analysis of epidemiological data can include straightforward univariable descriptive statistics to complex simulation or regression-based approaches. Regardless of the complexity of the analysis, again there are practical considerations that can influence causal inference. Practical analytic decisions that have the potential to induce P(Y|X), or lack thereof, include implications from the study design, model specification and assumptions, and variable selection. Aside from the well-known challenges of estimating causal relations from epidemiologic studies [20], there are practical considerations in the study design that can impact inference. For example, a study analysing perinatal outcomes may have multiple rows of data representing multiple gestations. Unlike the earlier case where the duplicate observations were an artifact of data management decisions, here the repeated nature of the data is intentional and inherent to the study design. In this case, failure to correctly account for the correlated observations, especially if there was a relatively large proportion of multiple births as could be expected in a study of fertility treatments [21], may artificially inflate the statistical model error terms and thereby obscure an otherwise apparent association in the data [22-24]. The choice of which statistical model to employ brings about a host of practical considerations as all statistical procedures carry assumptions. One must first consider the functional form of their model (i.e., model specification) [25]. When given a continuous outcome, it is perhaps plausible to assume that the average risk of Y varies linearly as a function of X. However, such assumptions without any descriptive assessment of the functional relation between X and Y may lead to a poorly fitted model and erroneous inferences. It is possible that other model forms such as quadratic, exponential, or spline may better characterize the functional relation between X and Y, and failure to capture this functional relation may bias one’s estimates. In addition, common regression methods such as ordinary least squares regression, for example, includes assumptions for independence of observations, linearity, homoskedasticity, and non-informative missingness. From a practical lens, violation of these assumptions can result in incorrect point estimates or error terms. This violation may in turn lead to a biased or chance association [26]. In such instances, the use of model diagnostic procedures is vital to detect both systematic and isolated departures from the data [27]. Outliers are an obvious example, and many diagnostic techniques such as residual plots, Cook’s distance, DFBETAS, and goodness-of-fit tests have been developed to evaluate the robustness of these model assumptions [28]. Furthermore, failure to evaluate whether missing data are informative will default to the model’s assumption and treatment of missing data, which is typically a complete cases analysis. Non-informative missingness often does not hold, and in such cases, methods of multiple imputation by chained equations [16] and weighting approaches [29] have been developed to address this concern. Relatedly, in most observational studies in epidemiology, one must also consider the variables to be included in a model, especially in research seeking to estimate the P(Y|X). This is because with observational data, we have no expectation of exchangeability, thus, attempts to control for a sufficient set of confounders to produce valid estimates of P(Y|X) are necessary. Although causal diagrams are helpful to depict the relational structure [30,31], this requires substantive background knowledge of the data-generating mechanism to appropriately identify a set of covariates, often using the backdoor path criterion [30], that are sufficient to control for confounding. Once these variables are identified, many techniques are currently available such as matching, stratification, multivariable adjustment [2], and propensity score methods [32] to control for confounding. However, knowledge of the entire causal structure is never available, which is why research needs to reflect this uncertainty given that residual confounding and undiagnosed measurement error can induce spurious associations [33]. In such instances, bias analysis techniques have been developed to quantify the impact of potential uncontrolled confounding and measurement error [34-40]. For the epidemiology student, practical approaches that are relatively simple to implement and impose fewer assumptions such as estimation of the E-value (the minimum strength of association of an unmeasured confounder to fully explain away an exposure-outcome relation as measured on the risk ratio scale) [41] or incorporation of negative controls (replication of the proposed experiment under conditions that are expected to produced null results) [42] will strengthen the student’s ability to assess the quality of evidence from observational data. In addition, various data-driven strategies have been developed for variable selection, including the significance criteria, information criteria, penalized likelihood (e.g., LASSO), change-in-estimate criterion, and variable selection algorithms [43,44]. Overreliance of data-driven methods for variable selection, however, can lead to the inclusion of too few or too many confounders, resulting in residual confounding. For example, given the rise in machine learning approaches to variable selection in high-dimensional datasets [4], there is the possibility of including inappropriate covariates in the statistical model. Except in the case of mediation analysis, one would not want to adjust for a mediator, yet because it is correlated with the exposure and outcome, a naive algorithmic approach would nevertheless include this variable, introducing the risk of biased estimation [45]. This further highlights the importance of accounting for the timing of covariates when making decisions on variable selection and the use of methods such as inverse probability weighting of exposure adjust for time-varying confounding [46]. The tension between a sound theoretical approach and a pragmatic data-driven approach is demonstrated in automated variable selection algorithms, which have largely been discouraged in epidemiological circles [26]. Thus, decisions to include or exclude variables in a model should be supported by background knowledge about the strength of evidence for their association with the exposure and outcome. In instances where variable selection algorithms are employed, one must consider the uncertainty resulting from the selection process and its impact on inference through sensitivity analysis. For a more in-depth discussion of these strategies, see Heinze et al. [43]

Case study

The following scenario is adopted from Goldstein et al. [47] in which the authors undertook a replication study of the association between a certain type of physical activity and all-cause mortality using exposure and covariate data from the National Health and Nutrition Examination Survey (NHANES) with mortality outcome data linked from the National Center for Health Statistics [48]. The motivation was to test the feasibility of reproducing a study’s findings based solely on the methods disclosed in the manuscript. As such, many implicit assumptions about the data were necessary during data management and analysis, and although these likely would not be disclosed in a typical original research article, herein we detail each consideration from Table 1 and how it may have affected the results. Missing data. Goldstein et al. noted how decisions about the treatment of missing data in NHANES could cut the analytic sample in half.49 Specifically, when operationalizing a single, latent variable based on the results of multiple survey questions, if the answers to one of the questions is missing, a data decision is needed: should the entire latent variable be set to missing and respondents without the latent variable excluded from analyses, or should the question with missing data simply be discarded from the construct of the latent variable? The authors therefore needed to balance the potential for selection bias, if the latent variable was omitted from a large number of respondents, versus an information bias, if only some aspect of the latent variable was ignored. On the other hand, a more prudent approach in the original work may have been to impute the missing data using one of the techniques discussed earlier. Duplicate observations. In most cases, each row in an NHANES data file corresponds to a unique participant. However, this is not always true. For example, the repeated measure of physical activity in this study was based on accelerometer data that were captured on a minute-by-minute basis [48]. Thus, each participant who wore an accelerometer had a one-to-many relationship to these data and failure to correctly perform a merge operation to create the appropriate person-level data may result in an extreme number of duplicate observations. This could induce several types of bias from selection to overly precise errors. As such, the authors could benefit from calculating the percentage of observed to potential record linkage to better under the degree of false or missed linkage. Inconsistent variable definition. Data linkage is commonplace when working with NHANES data. In fact, for the NHANES 2003–2004 and 2005–2006 survey cycles, the authors noted over 300 unique raw data files available [49]; this was in addition to the need to link to external mortality outcome data. As the practice of NHANES is to separate the various domains and instruments into separate raw data files, this necessitates data linkage. Not only does this present a problem for duplicate or missing data, but this also presents a problem for inconsistent variable definitions. For example, when using NHANES data one may consider measures of self-reported hepatitis C diagnosis from the questionnaire data [50] and laboratory-based measure of hepatitis C antibody and viral load [51]. The treatment of these two variables as interchangeable would be inappropriate as the questionnaire-based data are self-reported and subject to greater misclassification. Therefore, combining these two measures may induce an information bias. It is incumbent upon the researcher to recognize the conceptual differences of similar data collected through different methods, in this case, a self-reported versus laboratory-based diagnosis of hepatitis C infection. Study design. The original study’s research aim was to evaluate the association between a certain type of physical activity and all-cause mortality. Participants were sampled from NHANES, which employed a complex survey design to ensure representativeness of the U.S. civilian noninstitutionalized resident population [52]. Failure to consider this study design may affect estimates of variance, and consequently biased test statistics and confidence intervals. As the data used in this example were from two survey cycles, namely 2003–2004 and 2005–2006, NHANES guidance documents stipulated several considerations before aggregating data [52]. Specifically, the authors needed to ensure the proper weighting variable was used before combining the datasets, as there are multiple survey weights given. Model assumptions. The authors applied a Cox proportional hazard model to estimate how physical activity was associated with all-cause mortality in NHANES. This type of statistical model carries several assumptions including those common to all regression models, such as testing for the presence of influential observations, non-linearity, and non-informative missingness, as well as Cox-specific assumptions, namely proportional hazards [53]. While the authors of the original article stated that “the assumption of proportional hazards was tested and held true for our [physical activity] exposure,” the details are not provided nor are the other assumptions described, particularly procedures for handling missing data [48]. This is not unusual; statistical modelling assumptions are rarely discussed in original research articles [54]. Sharing of data and analytic code can facilitate reproducibility when adequate details cannot be provided in the methods due to word limits [55]. Variable selection. Goldstein et al. noted eight separate questionnaire items in NHANES pertaining to alcohol consumption [47]. In order to create a single measure of consumption one may simplistically check for a “positive” response to any of these eight items; however, this risks an information bias as these individual items may represent unique constructs and not be internally reliable. Relatedly, inclusion of this combined measure may result in residual confounding as it may be a poor variable to control for the underlying differences in alcohol consumption between respondents of lower physical activity levels to those of higher physical activity levels. The use of quantitative bias analysis, estimation of the E-value, or incorporation of negative controls may help the investigator evaluate the extent of residual confounding and measurement error.

Discussion

In the modern era of epidemiology, much thought and consideration of complex topics has yielded a rich theoretical and methodological foundation for researchers [20,56]. Our comments underscore their connection to practical considerations as the basis for valid epidemiology. In short, our work depends upon sound data, and we should not necessarily take for granted that current epidemiology training programs impart the knowledge and skills needed to manipulate complex study data prior to—and in some cases—post analysis [4]. Thus, we emphasize the importance of data literacy for students in epidemiology training programs. In fact, modern epidemiology may be more reflective of a computer and data scientist’s skillset than a physician’s mastery of medicine, contrary to the origin of our field. The six practical considerations in Table 1 are not intended to be an exhaustive list. There are practical issues encountered at all stages of epidemiology: from asking an addressable public health question, to collecting data, to disseminating findings in an appropriate matter. The idealized research continuum many are taught may also conflict with real-world epidemiology, such as emergency public health responses during an outbreak, which may invoke some of the practical considerations in Table 1. We have focused our comments on the data aspect of epidemiologic research as opposed to the other, albeit equally important, facets such as the description of disease frequency that may provide salient information for understanding health disparities and risk factors. Additionally, epidemiologic data will always be imperfect, but not every data issue will lead to invalid inference. For example, in a study of thousands of individuals, a single duplicate record will have a negligible impact on the standard errors. In the case where selection or information bias is expected to substantially diminish the validity of results, analyses should reflect this uncertainty and investigators need to consider methods in bias analysis to compute bias-adjusted estimates to address systematic errors [2,36,37,57]. Through the case study, we illustrated how the analyst must think about such data concerns systematically, which, as a side benefit, can aid in the reproducibility of study findings [47]. Transparency in data and computing codes is one mechanism whereby others can vet the more practical issues of analysing data, for example, through a peer review process specific to research materials [49]. In summary, epidemiology is built upon sound theoretical reasoning, appropriate methodology, and valid data. The first two are well known; the third should not be taken for granted.

39 in total

Practical data considerations for the modern epidemiology student.

Introduction

Data management

Analysis

Case study

Discussion

1. A structural approach to selection bias.

2. A definition of causal effect for epidemiological research.

3. Negative controls: a tool for detecting confounding and bias in observational studies.

4. Causation and causal inference in epidemiology.

5. Assessing the sensitivity of regression results to unmeasured confounders in observational studies.

6. All your data are always missing: incorporating bias due to measurement error into the potential outcomes framework.

7. Good practices for quantitative bias analysis.

8. Identifiability, exchangeability, and epidemiological confounding.

9. Models for longitudinal data: a generalized estimating equation approach.

Review 10. Outcome modelling strategies in epidemiology: traditional methods and basic alternatives.