| Literature DB >> 34295910 |
Yoshihiko Raita1, Carlos A Camargo1,2,3, Liming Liang1,3,4, Kohei Hasegawa1,3,4.
Abstract
Clinicians handle a growing amount of clinical, biometric, and biomarker data. In this "big data" era, there is an emerging faith that the answer to all clinical and scientific questions reside in "big data" and that data will transform medicine into precision medicine. However, data by themselves are useless. It is the algorithms encoding causal reasoning and domain (e.g., clinical and biological) knowledge that prove transformative. The recent introduction of (health) data science presents an opportunity to re-think this data-centric view. For example, while precision medicine seeks to provide the right prevention and treatment strategy to the right patients at the right time, its realization cannot be achieved by algorithms that operate exclusively in data-driven prediction modes, as do most machine learning algorithms. Better understanding of data science and its tasks is vital to interpret findings and translate new discoveries into clinical practice. In this review, we first discuss the principles and major tasks of data science by organizing it into three defining tasks: (1) association and prediction, (2) intervention, and (3) counterfactual causal inference. Second, we review commonly-used data science tools with examples in the medical literature. Lastly, we outline current challenges and future directions in the fields of medicine, elaborating on how data science can enhance clinical effectiveness and inform medical practice. As machine learning algorithms become ubiquitous tools to handle quantitatively "big data," their integration with causal reasoning and domain knowledge is instrumental to qualitatively transform medicine, which will, in turn, improve health outcomes of patients.Entities:
Keywords: big data; causal inference; data science; machine learning; the ladder of causation
Year: 2021 PMID: 34295910 PMCID: PMC8290071 DOI: 10.3389/fmed.2021.678047
Source DB: PubMed Journal: Front Med (Lausanne) ISSN: 2296-858X
Glossary.
| Causal effect | In this article, causal effects refer to average causal (or treatment) effects rather than individual causal effects. In a binary exposure situation (e.g., treatment yes vs. no), it is the average difference between two counterfactual outcomes under two different treatments across all individuals in the population. The effect can be represented with different measures—e.g., risk difference, risk ratio, and odds ratio |
| Causal graphs | A graphical tool for qualitatively encoding domain knowledge and |
| Causal inference | The process of using data in a sample to infer cause-and-effect relationships in the target population of interest |
| Collider | A variable that is causally influenced by two or more variables. In a causal diagram, it is a node on which multiple directed edges “collide” ( |
| Confounding | The structural definition of confounding is the bias secondary to common causes of exposure and outcome (i.e., the bias due to confounders). For example ( |
| Consistency | One of three identifiability conditions. Consistency means that the observed outcome for every exposed individual equals his or her (counterfactual) outcome if he or she had received the exposure. This condition requires a well-defined exposure or treatment |
| Counterfactual causal inference | Causal inference based on the framework of counterfactuals to identify and estimate causal effects. For a binary exposure situation (e.g., treatment yes vs. no), this framework presupposes the existence of two outcome states (i.e., two counterfactual outcomes) to which all individuals of the population could be exposed. Counterfactual framework encompasses several models, such as the Neyman-Rubin potential outcome model and Pearl's structural causal model |
| Data | Information that are collected through observation [e.g., through observational studies, randomized controlled trials, biobanks, biometrics, electronic health records ( |
| Data science | An interdisciplinary concept that extracts knowledge and insights from data, using theories and techniques from many fields including computer science, statistics, epidemiology, and other domain knowledge sciences (e.g., medicine). Its major tasks include association and prediction, intervention, and counterfactual causal inference (and description). In this article, data science and health data science are used interchangeably |
| Domain (or subject-matter knowledge) | The knowledge of specialists or experts in a particular field. In our situation, it represents clinical and biological knowledge (e.g., medicine, pediatrics, pulmonology) |
| Effect modification | The situation where the magnitude (i.e., quantitative) or the direction (i.e., qualitative) of the effect of exposure on the outcome differs depending on a third variable—the “effect modifier.” Effect modification is sometimes called an “interaction” in statistical science |
| Exchangeability | One of three identifiability conditions—the exposed and unexposed individuals are exchangeable with regard to their risk factors for the outcome. In a randomized controlled trial, randomization ensures that these risk factors are equally distributed. In an observational study (conditional) exchangeability can be achieved by adjusting for a sufficient set of confounders (i.e., no unmeasured confounding) |
| Identifiability conditions | Three conditions (consistency, exchangeability, and positivity) required to identify the average causal effect of interest from data. When three identifiability conditions hold true, an observational study can be conceptualized as a conditionally randomized experiment |
| Instrumental variable (IV) methods | An analytic approach that examines the causal effect of exposure on outcome. This approach replaces the exchangeability assumption (i.e., no unmeasured confounding) with an alternative set of IV conditions—the relevance, independence, exclusion criterion conditions, and monotonicity ( |
| Machine learning | Machine learning (particularly, statistical learning) refers a set of algorithms for modeling and understanding complex data. It encompasses many algorithms, such as supervised learning (e.g., lasso regression, random forest, boosting, neural network [or deep learning]) and unsupervised learning (e.g., clustering, principal component analysis). Some examples are summarized in |
| Mediation analysis | Causal mediation analysis is an approach that aims to tease apart the total effect, natural indirect (or mediation) effect, and natural direct effect by using a counterfactual framework. The natural indirect effect represents how much the outcome risk would change if patient were set to be exposed, but the mediator value were changed from the value it would take if unexposed to the level it would take if exposed. The natural direct effect represents how much the outcome risk would change if patient were set to be exposed vs. to be unexposed but for each patient the mediator value were kept at the level it would have taken in the absence of exposure |
| Mendelian randomization | An analytic approach that examines the causal effect of a modifiable exposure (e.g., physical traits, molecular biomarkers) on the outcome of interest by using genetic variants as IVs |
| Positivity | One of three identifiability conditions—the probability of receiving every value of treatment/exposure conditional on a set of covariates is >0 (i.e., positive). For example, if all individuals received the same treatment/exposure level (i.e., a violation of positivity), it would be impossible to estimate the average causal effect |
Scientific questions, required information, and analytical methods of data science according to the ladder of causation.
| - What are the risk factors for developing asthma? - What is the probability of developing asthma in a patient with a set of predictors? | - Risk factors/predictors | - Regression | |
| Will a new biologic agent decrease the rate of asthma exacerbation by 30%, compared to placebo? | - Eligibility criteria | - Elementary statistics in RCTs (e.g., risk differences of the outcome) | |
| What would be the preventive effect of a new drug had it been given to a group of patients with a set of characteristics? | - Eligibility criteria | - Regression |
IPW/MSM, inverse probability weighing for marginal structure model; IV, instrumental variable; RCT, randomized controlled trial.
For all tasks, no information bias (no measurement error or misclassification) and no model misspecification are required.
The effect of interest must occur after the cause (and an expected delay) during an observation period.
Major analytical tools used in data science.
| Causal mediation analysis ( | Counterfactual causal inference | - The models well-represent the hypothesized cause-and effect process that generates the data (e.g., temporal sequence) | - Identification of causal mechanisms (e.g., direct and indirect effects) | - Interpretation of natural direct and indirect effects is complicated ( |
| Inverse probability weighting and marginal structural model ( | Counterfactual causal inference | - Model specifications | - Time-varying effects can be estimated | - Methodologically complex |
| Lasso regularization | Association/prediction | - Identification of hyperparameter | - Automated covariate selection | Only linear relation can be accommodated |
| Neural network/deep learning | Association/prediction | - Sample size | - Large number of predictors and non-linear relations can be accommodated | - Large sample size is often needed |
| Random forest | Association/prediction | Same as neural network | - Applications to identification of heterogeneous treatment effects (causal forest) | - Transportability to other domains is often limited |
| Unsupervised learning (e.g., hierarchical clustering, k-means) | Description of data (e.g., dimensional reduction, clustering) | - Appropriateness of the chosen distance measure for the dataset | - Hypothesis-free | - Hypothesis-generating in nature |
| Mendelian randomization (or IV analysis) ( | Counterfactual causal inference | Four IV conditions: (1) Relevance: strong correlation between genetic instruments and exposure(2) Independence: no association between instruments and exposure-outcome confounders(3) Exclusion restriction: instruments affect the outcome only through the exposure(4) Monotonicity assumption: increasing the number of effect alleles for an individual can only increase the level of exposure, and can never decrease it | - No-unmeasured- confounding assumption is not required | - Identification of appropriate instruments is often difficult |
| Propensity score matching ( | Counterfactual causal inference | - Model specifications | Simple interpretability | - Matched sample is often poorly-characterized |
| Intention-to-treat (ITT) analysis | Intervention | - Adherence to assigned treatment | - Interpretation is simple | - The causal estimate is not often the effect of interest in clinicians (i.e., ITT is agnostic about treatment decisions after the random assignment) |
| Per-protocol analysis | Intervention | - Adherence to assigned treatment | Estimates the effect of receiving the treatment as specified in the study protocol (if accounted for time-varying prognostic factors associated with adherence). | - Post-randomization time-varying factors are often unmeasured or unaccounted. |
| Regression ( | - Association/prediction | - Model specifications | - Simple interpretability | - Only conditional effects (within the levels of covariates) can be estimated (i.e., not marginal effects) |
| Standardization/g-formula ( | Counterfactual causal inference | Model specifications | - Marginal effects can be estimated | - Methodologically complex |
| Targeted learning using TMLE ( | Counterfactual causal inference | Standard identifiability conditions ( | - Use of machine learning that places minimal assumptions on the distribution of data and accommodate complex non-linear relationships | - Methodologically complex |
IV, instrumental variable; TMLE, targeted maximum likelihood estimation.
or any causal inference methods (except for IV-methods), the standard identifiability conditions (.
Figure 1Examples of causal directed acyclic graph that encodes a priori domain knowledge and causal structural hypothesis. (A) Birth-weight paradox. There is no direct arrow from maternal smoking (exposure) to infant mortality (outcome), representing no causal effect. However, association/prediction-mode machine learning algorithm would automatically adjust for variables that are associated both with smoking and mortality (e.g., low birth-weight). Graphically, a rectangle placed around the low-birth weight variable represents adjustment. However, this adjustment for the collider (a node on which two directed arrows “collide”; Table 1) opens the flow of association from exposure → collider → covariates (e.g., structural anomaly) → outcome, which leads to a spurious (non-causal) association. (B) Simple example of causal diagram, consisting of exposure (biologic agent), outcome (asthma control), and covariates (e.g., baseline severity of illness). The presence of edge from a variable to another represents our knowledge on the presence of a direct effect. (C) Example of confounding. While there is no causal effect (i.e., no direct arrow from exposure to outcome), there is an association between these variables through the paths involving a common-cause covariate (i.e., a confounder), leading to a non-causal association between the exposure and outcome (i.e., confounding; Table 1). (D) Example of de-confounding. This confounding can be addressed by adjusting for the confounder by blocking the back-door path. Graphically, a rectangle placed around the confounder blocks the association flow through the back-door path. (E) Example of mediation. The causal relation between the exposure (systemic antibiotic use), mediator (airway microbiome), and outcome (asthma development). The confounders (e.g., acute respiratory infections) between the exposure, mediator, and outcome should be adjusted. The indirect (or mediation) effect is represented by the path which passes through the mediator. The direct effect is represented by the path which does not pass (the broken line; Table 1). (F) Example of mendelian randomization. Genetic variants that are strongly associated with the exposure of interest (mental illnesses) function as the instrument variable. Note that there is no association (or path) between the genetic variants and unmeasured confounders (i.e., independent condition) and that the genetic variants affect the outcome only through their effect on the exposure (i.e., exclusion restriction condition; Table 3).
Figure 2Identification and estimation of heterogenous treatment effects. In this hypothetical example, suppose, we investigate treatment effects of systemic corticosteroids on hospitalization rates among preschool children with virus-induced wheezing. (A) Randomized control trial (RCT) to investigate the average treatment effect of systemic corticosteroids (conventional 1:1 RCT). (B) Investigating heterogeneous treatment effects using tree-based machine learning models. In each of the branches (e.g., subgroup A children have specific virus infection and a history of atopy), children have a comparable predicted probability of receiving systemic corticosteroids. Children within each subgroup function as if they came from an RCT with eligibility criteria stratified by clinical characteristics.
Twelve major resources for clinicians who wish to learn about data science.
| Data science (in general) | MOOC | Kahn academy | An online course that covers a wide range of topics about statistical analyses |
| MOOC | Coursera: data science specialization | An online course that provides a broad overview of data science | |
| MOOC | edX: introduction to probability (HarvardX STAT110x) | An online course that introduces the basics of probability theories, which are fundamental for data science, statistics, and causal inference | |
| MOOC | Stanford: statistical learning | An online learning course that offers an introduction to various statistical learning (including machine learning) approaches | |
| Textbook | A well-written introductory textbook that is used in the statistical learning course (see above) | ||
| Paper | |||
| Paper | |||
| Machine learning | MOOC | Coursera: machine learning | One of the most popular machine learning courses (as of January 2021, 3.9 million students have been enrolled). This introductory course provides an overview of various machine learning algorithms |
| MOOC | Coursera: Deep learning specialization | A more detailed online course that covers the basics and applications of various deep learning algorithms | |
| Causal inference | MOOC | edX: Causal diagrams (HarvardX PH559x) | An online course that introduces an overview of causal diagrams in clinical research |
| MOOC | Coursera: A crash course in causality | An online course offered that provides an introductory overview of causal inference theories and approaches | |
| Textbook | Introductory-level textbook that covers important topics in causal inference (e.g., causal diagram) | ||
| Textbook | Comprehensive intermediate-level textbook that provides the concepts of and methods for causal inference in clinical research | ||
| Programming | MOOC | Coursera: foundations using R specialization | An online course that provides a broad overview of R programing |
| Others | DataCamp | A collection of introductory video lectures and hand-on coding practices in several programing languages (e.g., R, python) |
MOOC, massive open online course; BMJ, British Medical Journal; JAMA, Journal of the American Medical Association.
All of the listed MOOCs are publicly-available without fee.
Figure 3Integration of “big data,” data science, and domain knowledge toward precision medicine. Development of precision medicine requires an integration of “big data” from expanded data sources and capture with robust data science methodologies and analytics that encode domain causal knowledge and counterfactual causal reasoning.