Literature DB >> 32615926

Directed acyclic graphs and causal thinking in clinical risk prediction modeling.

Marco Piccininni¹, Stefan Konigorski^2,3, Jessica L Rohmann⁴, Tobias Kurth⁴.

Abstract

BACKGROUND: In epidemiology, causal inference and prediction modeling methodologies have been historically distinct. Directed Acyclic Graphs (DAGs) are used to model a priori causal assumptions and inform variable selection strategies for causal questions. Although tools originally designed for prediction are finding applications in causal inference, the counterpart has remained largely unexplored. The aim of this theoretical and simulation-based study is to assess the potential benefit of using DAGs in clinical risk prediction modeling.
METHODS: We explore how incorporating knowledge about the underlying causal structure can provide insights about the transportability of diagnostic clinical risk prediction models to different settings. We further probe whether causal knowledge can be used to improve predictor selection in clinical risk prediction models.
RESULTS: A single-predictor model in the causal direction is likely to have better transportability than one in the anticausal direction in some scenarios. We empirically show that the Markov Blanket, the set of variables including the parents, children, and parents of the children of the outcome node in a DAG, is the optimal set of predictors for that outcome.
CONCLUSIONS: Our findings provide a theoretical basis for the intuition that a diagnostic clinical risk prediction model including causes as predictors is likely to be more transportable. Furthermore, using DAGs to identify Markov Blanket variables may be a useful, efficient strategy to select predictors in clinical risk prediction models if strong knowledge of the underlying causal structure exists or can be learned.

Entities: Chemical Disease Gene Species

Keywords: Causality; Clinical risk prediction; Directed acyclic graph; Markov blanket; Prediction models; Predictor selection; Transportability

Mesh：

Year: 2020 PMID： 32615926 PMCID： PMC7331263 DOI： 10.1186/s12874-020-01058-z

Source DB: PubMed Journal: BMC Med Res Methodol ISSN： 1471-2288 Impact factor: 4.615

Background

In modern epidemiology, prediction modeling and causal inference are generally considered separate branches with unique sets of methods and aims. However, recently, the emerging field of “causal learning” or “causal discovery” has led to the introduction of prediction modelling and machine learning techniques as tools to generate causal structures based on data-driven procedures [1]. Despite some specific implementations [2], movement in the other direction has been less explored; namely, the application of causal inference principles and graph theory in clinical risk prediction modeling strategies. Diagrams and graphs are intuitive, visual tools used to inform analytic methods to answer causal questions [3]. The increasing use of causal graphs and the need for automated procedures to assess causal effects given the combination of previous structural knowledge and new data led to the development of a compact, formal theory free of parametric assumptions to transparently model causal relationships [3]. Directed Acyclic Graphs (DAGs) are used to rigorously map all a priori assumptions surrounding a causal question of interest [3] and to graphically describe the underlying data generating process. In DAGs, each node represents a random variable, and directed causal paths are represented by arrows. The causal graph structure thus provides qualitative information about the conditional independencies of the variables of interest. DAGs are used as a tool in causal inference to illustrate potential sources of confounding and selection bias and ultimately identify suitable strategies to address them [3, 4]. We assume the reader is familiar with DAGs; for those not yet familiar, several accessible introductions have been published elsewhere [3, 5]. The aim of this work is to investigate the potential benefits of using DAGs and causal thinking in clinical risk prediction problems. Specifically, we describe the use of causal knowledge in assessing transportability and selecting predictors for a clinical risk prediction model.

Methods

Transportability and the principle of independent mechanisms

A causal concept that could be useful in clinical risk prediction modeling is the principle of independent mechanisms [1]. This fundamental assumption was formalized to justify the inference of causal structure from observed data [1, 6] and was later suggested as a useful hypothesis to drive machine learning-based prediction approaches [7]. This principle of independent mechanisms states that the “causal generative process of a system’s variables is composed of autonomous modules that do not inform or influence each other” [1]. This means that a causal process can be interpreted as a chain of independent mechanisms, in which each causal mechanism takes the state output from the previous mechanism as input and “feeds” the next mechanism with its own state output. Each causal mechanism on the chain can be conceptualized as a physical mechanism invariant to the input it receives [1]. The idea of the autonomy of the mechanisms is actually more intuitive than it seems. In fact, it is how we justify all clinical interventions: we assume that artificially changing one mechanism or its input will not affect any of the other mechanisms [1]. Let’s consider two variables with an unconfounded causal relationship. For simplicity, we will call these two variables “Cause” and “Effect”. The joint probability distribution of these two variables ℙ(Cause,Effect) can be factorized in two ways [1, 7]: The principle of independent mechanisms states that the marginal distribution of the variable Cause, ℙ(Cause), and the conditional distribution of the variable Effect on the variable Cause, ℙ(Effect|Cause), contain no information about each other [1, 7]. Indeed, ℙ(Effect|Cause) is the distribution of the variable Effect for each given value of the variable Cause. It represents the physical mechanism that transforms the input (Cause) into an output (Effect), while ℙ(Cause) represents the state of the input. Under the principle of independent mechanisms, ℙ(Cause) and ℙ(Effect|Cause) change independently of each other across different joint distributions [1]. This independence constraint in the first factorization induces a dependency between the conditional distribution of Cause on Effect, ℙ(Cause|Effect), and the marginal distribution of the Effect, ℙ(Effect), shown in the second mathematical factorization in the anticausal direction [1, 7]. Therefore, ℙ(Effect) and ℙ(Cause|Effect) often change in a dependent way across different joint distributions [1]. Since this concept of independence involves mechanisms rather than variables, it cannot be simply defined, tested, or quantified like the concept of statistical independence in probability theory [1]. In this work, we present two hypothetical, simplified clinical examples from the field of neurodegenerative disease to illustrate the consequences of the principle of independent mechanisms in the context of diagnostic clinical risk prediction models. Specifically, we describe the transportability of two clinical risk prediction models for Alzheimer’s disease diagnosis using different predictors. In the first example, the disease is the effect of the predictor (allele APOE ε4 status, which is a known cause of Alzheimer’s disease), while in the second example, the disease is the cause of the predictor (concentration of tau protein in cerebrospinal fluid, which is described as an effect of the Alzheimer’s disease pathological process).

Predictor selection and the Markov blanket

There is another causal concept that may be useful for the first and arguably most important step in building clinical risk prediction models: predictor selection. Here, we focus on the main challenge of selecting the smallest possible subset of all available variables that provide enough information to predict the outcome of interest with good validity in terms of calibration. There are many well-known reasons to limit the number of predictors used to build a risk prediction model: (i) to reduce problems due to the high number of variables in the model, thereby increasing performance, (ii) to reduce the costs, time and effort associated with data collection and storage, model development or training, (iii) to enable easier use of the model in different settings, and (iv) to increase the interpretability of the mechanisms behind the generation of the probability estimates [8, 9]. The last reason is particularly important in the context of clinical risk prediction models. Indeed, medical doctors are reluctant to use prediction models without a certain degree of interpretability [10], since the output probabilities are used to support clinical decisions about treatments and prevention strategies. Intuitively, the predictor selection problem can be interpreted as how to choose the smallest subset of variables excluding all variables that do not provide additional information on the outcome of interest. By operationalizing the lack of additional information using the notion of conditional independence [11], the entire problem of predictor selection is analogous to identifying the so-called “Markov Blanket” of the outcome variable. We define Y as the random variable for the outcome of interest and X as the set of all available candidate predictor variables of Y. We assume that X is a superset of the variables relevant to the causal processes in which Y is involved. The Markov Blanket of Y, MB(Y), is the minimal subset of X, conditioned on which, all other variables of Xnot included in MB(Y) are independent of Y [8, 9]: where X - MB(Y) denotes the set of variables which are contained in X but not in MB(Y). The concept of the Markov Blanket was first introduced by Pearl in 1988 in his work on Bayesian networks [12]. Years later, it was first used to identify the theoretical optimal set of variables for prediction tasks [11]. According to the definition above, given MB(Y), the other variables contained in X are independent of the outcome Y. This means that they do not provide any further information about Y, and all the information to predict the behavior of the outcome is already contained in the Markov Blanket MB(Y) [1, 13]. If the technique used to build the prediction model for Y can fully describe the underlying true probabilities Pr(Y|MB(Y)), and a model with fewer variables is preferred, then the variables included in the Markov Blanket of the outcome Y are the only variables needed for an optimal prediction in terms of calibration [8]. Therefore, in an idealized regression setting, to fit the appropriate model, the predictor selection task consists of finding the Markov Blanket of the outcome variable [1, 9]. This concept can be used to link variable selection in clinical risk prediction modeling to the underlying causal structure of the data [14]. Let’s consider a DAG G and a set of variables S described by a joint distribution ℙS with a density. The distribution ℙS, is said to be Markovian with respect to G if each variable is conditionally independent of its non-descendants (i.e. variables it does not affect), given its parents (i.e. its direct causes) [1, 9]. This Markov property creates a link between ℙS and G, ensuring that all the conditional independencies entailed by the DAG are also present in the probability distribution [1, 15]. A further condition makes this link stronger; “faithfulness” implies that the only conditional independencies to hold in the joint distribution ℙS are the ones entailed in G [14]. The previous intuition can be formalized; it has been demonstrated that if the joint distribution of the variables is faithful and Markovian with respect to the DAG, a predictor is strongly relevant (see [16, 17] for a definition) for predicting the outcome if and only if it is part of the Markov Blanket of the outcome [17]. Under these conditions, the Markov Blanket of the outcome is unique and has a particular constitution: it includes all parents of the outcome node, all of its children, and all parents of its children [1, 8, 9, 12]. As shown in Fig. 1, these nodes “shield” the outcome variable Y from all the remaining variables in the DAG [13]. Therefore, the information contained in these nodes is sufficient to describe the outcome variable’s status.

Fig. 1

Example of the Markov Blanket (in black) of outcome Y in a simple Directed Acyclic Graph (DAG) with many nodes

Example of the Markov Blanket (in black) of outcome Y in a simple Directed Acyclic Graph (DAG) with many nodes These results are appealing for researchers tasked with selecting predictors for clinical risk prediction modeling. According to a 2010 review, at least 8 different algorithms have been developed to identify the Markov Blanket for an outcome variable using data-driven procedures [9]. In the field of causal learning, algorithms that learn the entire causal structure [14] and the local causal structure [18] based on the identification of Markov Blankets have been developed. Given this theoretical line of argumentation, we believe that a knowledge of the underlying causal processes behind the data generation can help to identify the best predictors to be included in a clinical risk prediction model. As proof of concept, we conducted a series of simulations using R version 3.6.3 (R code can be found in the Supplementary file). We simulated 100,000 datasets with 25 variables and 10,000 observations each. Each dataset was simulated according to a randomly generated DAG (using the randomDAG function in the dagitty R package). The DAG included 25 ordered nodes corresponding to 25 variables. Each node was given a probability of 0.1 of receiving a directed arrow from each of the individual previous nodes. One of the nodes was then randomly selected as the binary outcome of interest, all other 24 variables were assumed to be continuous. Any exogenous variables (i.e. variables without any parent nodes) were generated as normally distributed variables with a mean of 0 and variance of 1, or, if the outcome was exogenous, as a Bernoulli random variable with an event probability of 0.2. When the outcome was an endogenous variable (i.e., with at least one parent node), each observation was drawn from a Bernoulli distribution with a defined probability parameter. This was set as the inverse-logit function evaluated at the linear combination of the outcome node’s parent variables, with randomly drawn coefficients. Specifically, the coefficients (including the intercept) for the outcome endogenous variable were drawn from a uniform distribution on (− 1,1). Similarly, the observations of the continuous endogenous variables were randomly drawn from a normal distribution with unit variance and with the mean equal to the linear combination of randomly drawn coefficients and the values of the node’s parent variables. Here, the coefficients (including the intercept) for each endogenous variable were drawn from a uniform distribution on (− 2,2). The choice of the regression coefficients was therefore not restricted in order to satisfy the faithfulness assumption by design. For each of the 100,000 datasets, eight prediction tools were developed to predict the probability that the binary outcome equals 1: a logistic regression model including only variables in the Markov Blanket of the outcome as predictors, a logistic regression model including all 24 variables as predictors, a logistic regression model including any variable with a path leading to the outcome node (regardless of arrow direction on the path) as predictors, a logistic regression model including only the outcome node’s parent variables as predictors, a logistic lasso regression model inputting all 24 variables, a logistic ridge regression model inputting all 24 variables, a logistic elastic net regression model with mixing parameter alpha of 0.5 inputting all 24 variables, and a random forest algorithm inputting all 24 variables. In all regression models, all included variables were modeled as being linearly related to the logit of the outcome. Lasso, ridge, and elastic net models were computed using the glmnet function in the glmnet R package with default settings. The regularization parameter, lambda, that minimized the 10-fold cross-validated error based on the deviance for logistic regression with the cv.glmnet function (glmnet package) was selected. Random forests were built using the randomForest function in the randomForest R package with 1000 trees and default settings. For each dataset, the calibration of each prediction tool was measured using the Integrated Calibration Index [19] (ICI) based on 10-fold cross-validation. Lower ICI indicates better model calibration. The ICI estimation relies on a non-parametric regression between the outcome variable and the predicted risk estimated by the prediction tool. Therefore, if the non-parametric regression fails in one or more of the 10 cross-validation sets, it is not possible to compute the ICI. This happens if an intercept-only model or a model with variables’ regression coefficients very close to 0 is evaluated. We also compared the variable sets included in the Markov Blanket-based logistic models with the ones selected by the lasso and elastic net regression models. We considered a variable to be selected by the model if the absolute value of its estimated regression coefficient was nonzero, which we operationalized as a value higher than 10− 10.

Results

The potential benefit gained from applying the principle of independent mechanisms to the assessment of transportability of clinical risk prediction models is presented using two simplified clinical examples from the field of neurodegenerative disease.

Example 1

Say that we are interested in building a diagnostic clinical risk prediction model for the presence of Alzheimer’s disease (Y = 1), using the APOE ε4 allele status (X = 1, presence; X = 0, absence) as the sole predictor of the outcome in the general population of older persons. Y = 0 indicates disease absence. Since APOE ε4 is a known cause for Alzheimer’s disease [20], we could draw the DAG shown in Fig. 2. Note that we are assuming a direct, unconfounded causal relationship (a strong assumption). By convention, each variable in the DAG is affected by a “noise” variable, which are assumed to be independent of other noise variables and modeled as random variables. These are usually not explicitly depicted because they are not of relevance to the causal relationship under study. However, it is worth noting that the noise variable affecting X determines the prevalence of the APOE ε4 allele, while the noise variable affecting Y contributes to the definition of the causal mechanism between the APOE ε4 allele status and Alzheimer’s disease [7].

Fig. 2

Directed Acyclic Graph (DAG), Example 1

Directed Acyclic Graph (DAG), Example 1 Assume we collect cross-sectional data about Alzheimer’s disease and APOE ε4 allele status in a population A. Using this data, we can develop a simple diagnostic clinical risk prediction model using logistic regression to predict the presence of Alzheimer’s disease. The regression equation would be: Using the logistic regression equation it’s possible to estimate the four conditional probabilities Pr(Y = 1|X = 0), Pr(Y = 1|X = 1), Pr(Y = 0|X = 0), and Pr(Y = 0|X = 1), which define the conditional distribution ℙ(Y|X). We will assume that the logistic regression is able to fully describe this conditional distribution, while the prevalence of the APOE ε4 allele (Pr(X = 1)) defines the marginal distribution ℙ(X) of this predictor. Next, say we want to use our newly developed risk prediction model as a diagnostic tool for Alzheimer’s disease in another population B in which we know there is a different prevalence of the APOE ε4 allele. The new distribution of the predictor X in population B can be denoted as ℙ(X). According to the principle of independent mechanisms, the fact that the original distribution of X, ℙ(X), has been changed to ℙ(X) does not give any information on the mechanism ℙ(Y|X) in population B [1, 7]. This is because X causes Y, and ℙ(Cause) is independent of ℙ(Effect|Cause). If the underlying causal mechanism is not altered (ℙ(Y|X) is the same in the two populations), the diagnostic clinical risk prediction model developed in population A will produce valid estimates also in population B. On the other hand, if the causal mechanism changed, knowing the predictor distribution ℙ(X) does not give us any information about how the mechanism changed [1, 7]. In this case, the logistic regression model developed in population A for modeling ℙ(Y|X) is still our best diagnostic tool candidate [1, 7]. In this example, knowledge of the underlying causal structure suggests that using the same diagnostic clinical risk prediction model in the new population is a reasonable choice [1, 7].

Example 2

Next, say we are still interested in building a diagnostic clinical risk prediction model for the presence of Alzheimer’s disease, but instead choose to use a different variable as the sole predictor, which indicates whether the concentration of tau protein in cerebrospinal fluid (CSF-tau) is above a predefined threshold. As before, Y = 1 and Y = 0 indicate presence and absence of Alzheimer’s disease. K = 1 indicates high tau protein concentration, and K = 0 indicates low tau protein concentration. It is known that high CSF-tau levels are associated with the presence of Alzheimer’s disease. Specifically, as a consequence of the deposition of proteins in the brain that characterizes Alzheimer’s disease, the concentration of tau protein is altered in the cerebrospinal fluid [21]. Therefore, the high level of tau protein in the cerebrospinal fluid can be interpreted as a consequence of Alzheimer’s disease, leading to the DAG shown in Fig. 3.

Fig. 3

Directed Acyclic Graph (DAG), Example 2

Directed Acyclic Graph (DAG), Example 2 In this example, we define Alzheimer’s disease by its underlying pathological process instead of based on diagnostic criteria. However, in the real world, direct effects are usually incorporated as part of the diagnostic criteria of the disease for practical clinical purposes. We further assume a direct effect of Y on K without confounding, even though we acknowledge direct effects of a disease are typically also caused by risk factors for the disease (introducing confounding in the Y → K causal relationship depicted in Fig. 3). These strong assumptions are needed to create a simplified, illustrative example. As before, assume we have collected cross-sectional data about Alzheimer’s disease and CSF-tau concentration in a new population C. Using population C data, we can develop another simple diagnostic clinical risk prediction model to predict Alzheimer’s disease using logistic regression. The estimated regression equation would be: Assuming that logistic regression is suitable, its equation fully describes the underlying conditional distribution ℙ(Y|K), while the prevalence of the high CSF-tau (Pr(K = 1)) defines the marginal distribution ℙ(K) of the predictor. Say that we now want to apply this diagnostic clinical risk prediction model developed in population C to detect the presence of Alzheimer’s disease in a population D with a different prevalence of high CSF-tau concentration. However, we are now in an anticausal scenario in which we are trying to use the effect, CSF-tau concentration, to detect the cause, Alzheimer’s disease. Therefore, ℙ(Y|K) does not represent a causal mechanism and is not independent of ℙ(K). Since the marginal distribution of CSF-tau levels changes from ℙ(K) in population C to ℙ (K) in population D, a change in the conditional distribution, ℙ(Y|K), is likely to occur because we are in an anticausal direction [1, 7]. The model developed in population C to describe ℙ(Y|K) will probably not be well calibrated for use in the population D because the underlying conditional distribution of Y on K is different in the two populations. This would also hold if the causal mechanism that leads from Alzheimer’s disease to the high CSF-tau concentration was the same in the two populations, as the equation describing the conditional distribution of Y on K is purely a mathematical artefact and does not describe the causal process. The results of the simulation study investigating whether a strong knowledge of the causal structure underlying the data generation process improves predictor selection compared to other commonly implemented methods are shown in Table 1.

Table 1

Simulation Results: Prediction Tools’ Performance Metrics

	Logistic, Markov Blanket set (Nsim=100,000)	Logistic, all 24 variables (Nsim=100,000)	Logistic, any variables with a path to the outcome (Nsim=100,000)	Logistic, node’s parent variables (Nsim=100,000)	Lasso, all 24 variables (Nsim=100,000)	Ridge, all 24 variables (Nsim=100,000)	Elastic net, all 24 variables (Nsim=100,000)	Random forest, all 24 variables (Nsim=100,000)
FULL RESULTS: Including all simulated datasets
ICI
N Missing	8032	0	8032	37,272	8597	0	8612	1
Mean (SD)	0.01882 (0.00445)	0.01964 (0.00495)	0.01900 (0.00461)	0.02215 (0.00421)	0.01912 (0.00451)	0.03807 (0.02058)	0.01907 (0.00456)	0.04133 (0.01779)
Median	0.01857	0.01925	0.01867	0.02242	0.01888	0.02895	0.01881	0.03636
Range	0.00290–0.03834	0.00289–0.04330	0.00287–0.04330	0.00290–0.03826	0.00287–0.03919	0.00710–0.18537	0.00340–0.04283	0.00704–0.16493
Number of input variables
N Missing	0	0	0	0	0	0	0	0
Mean (SD)	4.0 (2.8)	24.0 (0.0)	18.9 (7.0)	1.2 (1.3)	24.0 (0.0)	24.0 (0.0)	24.0 (0.0)	24.0 (0.0)
Median	3.0	24.0	22.0	1.0	24.0	24.0	24.0	24.0
Range	0.0–19.0	24.0–24.0	0.0–24.0	0.0–9.0	24.0–24.0	24.0–24.0	24.0–24.0	24.0–24.0
Direct comparison: ICI of various methods compared to Markov Blanket-based logistic tool
N Missing	8032	8032	8032	37,272	9140	8032	9147	8033
< ICI logistic MB, N (%)		39,354 (42.79%)	39,540 (42.99%)	4864 (7.75%)	26,514 (29.18%)	8871 (9.65%)	31,089 (34.22%)	1650 (1.79%)
≥ ICI logistic MB, N (%)		52,614 (57.21%)	52,428 (57.01%)	57,864 (92.25%)	64,346 (70.82%)	83,097 (90.35%)	59,764 (65.78%)	90,317 (98.21%)
COMPLETE CASE RESULTS: only including datasets for which ICI could be estimated for all tools
ICI
N Missing	37,841	37,841	37,841	37,841	37,841	37,841	37,841	37,841
Mean (SD)	0.01956 (0.00463)	0.01975 (0.00477)	0.01970 (0.00473)	0.02211 (0.00421)	0.01995 (0.00471)	0.03886 (0.02177)	0.01990 (0.00476)	0.04049 (0.02011)
Median	0.01953	0.01962	0.01960	0.02238	0.01993	0.02883	0.01987	0.03283
Range	0.00290–0.03834	0.00289–0.04330	0.00287–0.04330	0.00290–0.03826	0.00287–0.03919	0.00710–0.18537	0.00340–0.04283	0.00704–0.16493
Number of input variables
N Missing	37,841	37,841	37,841	37,841	37,841	37,841	37,841	37,841
Mean (SD)	4.1 (2.7)	24.0 (0.0)	20.8 (3.9)	1.9 (1.1)	24.0 (0.0)	24.0 (0.0)	24.0 (0.0)	24.0 (0.0)
Median	4.0	24.0	22.0	2.0	24.0	24.0	24.0	24.0
Range	1.0–19.0	24.0–24.0	1.0–24.0	1.0–9.0	24.0–24.0	24.0–24.0	24.0–24.0	24.0–24.0
Direct comparison: ICI of various methods compared to Markov Blanket-based logistic tool
N Missing	37,841	37,841	37,841	37,841	37,841	37,841	37,841	37,841
< ICI logistic MB, N (%)		26,872 (43.23%)	27,124 (43.64%)	4850 (7.80%)	16,887 (27.17%)	6508 (10.47%)	19,959 (32.11%)	1636 (2.63%)
≥ ICI logistic MB, N (%)		35,287 (56.77%)	35,035 (56.36%)	57,309 (92.20%)	45,272 (72.83%)	55,651 (89.53%)	42,200 (67.89%)	60,523 (97.37%)

In a series of 100,000 simulated datasets, we obtained these results for ICI and number of input variables for the eight investigated prediction tools. Full results and complete case results, including only datasets for which ICI could be estimated for all tools are presented

Abbreviations: ICI integrated calibration index, MB Markov Blanket, Nsim number of simulations, SD standard deviation

Simulation Results: Prediction Tools’ Performance Metrics In a series of 100,000 simulated datasets, we obtained these results for ICI and number of input variables for the eight investigated prediction tools. Full results and complete case results, including only datasets for which ICI could be estimated for all tools are presented Abbreviations: ICI integrated calibration index, MB Markov Blanket, Nsim number of simulations, SD standard deviation In 37,272 of the 100,000 simulated datasets, the outcome variable node did not have any parents, therefore it was not possible to assess the performance of logistic regression including only the outcome node’s parent variables as predictors in these cases (Table 1). In 8032 simulated datasets, the outcome variable node did not have any parents or children, therefore it was not possible to assess the performance of the Markov Blanket-based logistic model and the logistic regression including all the variables with a path to the outcome as predictors (Table 1). When the Markov Blanket set was empty, both the lasso and elastic net regression models correctly shrunk all regression coefficients to zero or very close to zero approximately 93.3% of the time, leading to an uncomputable ICI. Overall, the lasso regression selected exactly the Markov Blanket set of variables in at least one of the ten cross-validations in 14,936 (14.9%) simulated datasets. The percentage was higher when the Markov Blanket was empty (93.3%) or included only one variable (46.8%) compared to when it contained two (7.6%) or more variables. This finding supports the idea proposed by Li et al. that there is a link between the lasso regularization and selection algorithm and the identification of the Markov Blanket [22]. Overall, the average ICI of the Markov Blanket-based logistic model (0.01882) was lower compared with all other investigated prediction tools. This model also yielded the lowest average ICI (0.01956) when considering only those datasets in which all prediction tools had computable ICI values (Table 1). In head-to-head comparisons, the ICI of the various prediction tools were greater than or equal to the ICI of the Markov Blanket-based logistic model in the majority of the simulated datasets (range: 57.0 to 98.2%).

Discussion

Through the two simple examples presented, we provide a theoretical basis for the intuition that a diagnostic clinical risk prediction model including causes as predictors may be more transportable [23]. As illustrated in Example 2, transportability in terms of calibration is likely to be lower in anticausal scenarios, in which the predicted outcome is the disease and the predictor is an effect of the outcome [1, 7]. No common causes of Y and K were included in the simplified Example 2, and we note that the transportability of the diagnostic clinical risk prediction model to different populations in similar anticausal scenarios could be higher if the predictor and the disease share one or more common cause(s). The idea that risk prediction models including the direct causes of an outcome of interest as predictors will be more transportable to different settings is also exploited in the causal learning “invariant causal prediction” method [1] and in the machine learning practice of “covariate shift” [1, 7]. In general, we think the field of diagnostic clinical risk prediction modeling could greatly benefit from the practice of incorporating knowledge of the underlying causal structure in modelling strategies. The integration of such information could provide insights into the transportability of a given diagnostic risk prediction model in different settings [7]. Our results empirically demonstrated equal or superior performance of the Markov Blanket-based logistic model, corroborating the theories presented earlier. In the head-to-head comparisons with each of the other approaches, the Markov Blanket-based logistic model yielded an equal or better calibration in more than 57% of all generated datasets (range 57 to 98% across the compared prediction tools). Not only did the Markov Blanket-based logistic model show good performance in terms of calibration but also required considerably fewer input variables than the number of available variables. Moreover, this approach relies explicitly on summarizing causal knowledge, which provides a high degree of interpretability in contrast to commonly encountered causally agnostic approaches. We acknowledge that in real-world settings, it is unlikely to encounter ideal situations in which there is perfect knowledge of the underlying causal structure, all requisite variables are available and complete, and non-linear relationships and interactions are absent. Further research on deviations from these ideal conditions is needed, in particular to understand consequences of model misspecification when statistical interactions or non-linear relationships are present as well as measurement error. Nevertheless, we believe our results provide an important contribution as a theoretical basis for using a DAG that summarizes a priori knowledge of the causal structure to identify predictors in a simple and structured way in an ideal setting.

Conclusions

Through a series of theoretical examples and simulation results, we have shown that strong knowledge of the underlying causal structure can be useful for understanding potential transportability and optimizing predictor selection for a given clinical risk prediction model. In the field of clinical risk prediction model development and application, we think that a priori causal information is often ignored or used intuitively without a structured framework. We are eager to see first applications of the framework we have outlined, further theoretical development, and scientific discussion of this concept. Additional file 1.

7 in total

1. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology.

Authors: Miguel A Hernán; Sonia Hernández-Díaz; Martha M Werler; Allen A Mitchell
Journal: Am J Epidemiol Date: 2002-01-15 Impact factor: 4.897

2. Causal diagrams for epidemiologic research.

Authors: S Greenland; J Pearl; J M Robins
Journal: Epidemiology Date: 1999-01 Impact factor: 4.822

Review 3. APOE and Alzheimer's Disease: Evidence Mounts that Targeting APOE4 may Combat Alzheimer's Pathogenesis.

Authors: Md Sahab Uddin; Md Tanvir Kabir; Abdullah Al Mamun; Mohamed M Abdel-Daim; George E Barreto; Ghulam Md Ashraf
Journal: Mol Neurobiol Date: 2018-07-21 Impact factor: 5.590

Review 4. Diagnosis of Alzheimer's disease utilizing amyloid and tau as fluid biomarkers.

Authors: Jinny Claire Lee; Soo Jung Kim; Seungpyo Hong; YoungSoo Kim
Journal: Exp Mol Med Date: 2019-05-09 Impact factor: 8.718

5. On the interpretability of machine learning-based model for predicting hypertension.

Authors: Radwa Elshawi; Mouaz H Al-Mallah; Sherif Sakr
Journal: BMC Med Inform Decis Mak Date: 2019-07-29 Impact factor: 2.796

6. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models.

Authors: Peter C Austin; Ewout W Steyerberg
Journal: Stat Med Date: 2019-07-03 Impact factor: 2.373

7. Using marginal structural models to adjust for treatment drop-in when developing clinical prediction models.

Authors: Matthew Sperrin; Glen P Martin; Alexander Pate; Tjeerd Van Staa; Niels Peek; Iain Buchan
Journal: Stat Med Date: 2018-08-02 Impact factor: 2.373

7 in total

14 in total

1. When will individuals meet their personalized probabilities? A philosophical note on risk prediction.

Authors: Olaf M Dekkers; Jesse M Mulder
Journal: Eur J Epidemiol Date: 2020-11-28 Impact factor: 8.082

2. Circulating Palmitoyl Sphingomyelin Is Associated With Cardiovascular Disease in Individuals With Type 2 Diabetes: Findings From the China Da Qing Diabetes Study.

Authors: Yanyan Chen; Hongmei Jia; Xin Qian; Jinping Wang; Meng Yu; Qiuhong Gong; Yali An; Hui Li; Sidong Li; Na Shi; Zhongmei Zou; Guangwei Li
Journal: Diabetes Care Date: 2022-03-01 Impact factor: 19.112

Review 3. DNA methylation-based predictors of health: applications and statistical considerations.

Authors: Paul D Yousefi; Matthew Suderman; Ryan Langdon; Oliver Whitehurst; George Davey Smith; Caroline L Relton
Journal: Nat Rev Genet Date: 2022-03-18 Impact factor: 53.242

4. English Bulldogs in the UK: a VetCompass study of their disorder predispositions and protections.

Authors: Dan G O'Neill; Alison Skipper; Rowena M A Packer; Caitriona Lacey; Dave C Brodbelt; David B Church; Camilla Pegram
Journal: Canine Med Genet Date: 2022-06-15

5. Health of Pug dogs in the UK: disorder predispositions and protections.

Authors: Dan G O'Neill; Jaya Sahota; Dave C Brodbelt; David B Church; Rowena M A Packer; Camilla Pegram
Journal: Canine Med Genet Date: 2022-05-18

Review 6. A scoping review of causal methods enabling predictions under hypothetical interventions.

Authors: Lijing Lin; Matthew Sperrin; David A Jenkins; Glen P Martin; Niels Peek
Journal: Diagn Progn Res Date: 2021-02-04

7. Developing a prediction model to estimate the true burden of respiratory syncytial virus (RSV) in hospitalised children in Western Australia.

Authors: Amanuel Tesfay Gebremedhin; Alexandra B Hogan; Christopher C Blyth; Kathryn Glass; Hannah C Moore
Journal: Sci Rep Date: 2022-01-10 Impact factor: 4.379

8. French Bulldogs differ to other dogs in the UK in propensity for many common disorders: a VetCompass study.

Authors: Dan G O'Neill; Rowena M A Packer; Peter Francis; David B Church; Dave C Brodbelt; Camilla Pegram
Journal: Canine Med Genet Date: 2021-12-16

9. Systematic review and meta-analysis to examine intrapartum interventions, and maternal and neonatal outcomes following immersion in water during labour and waterbirth.

Authors: Ethel Burns; Claire Feeley; Priscilla J Hall; Jennifer Vanderlaan
Journal: BMJ Open Date: 2022-07-05 Impact factor: 3.006

10. Differences in estimates for 10-year risk of cardiovascular disease in Black versus White individuals with identical risk factor profiles using pooled cohort equations: an in silico cohort study.

Authors: Ramachandran S Vasan; Edwin van den Heuvel
Journal: Lancet Digit Health Date: 2022-01