Literature DB >> 35576237

Robust inference of bi-directional causal relationships in presence of correlated pleiotropy with GWAS summary data.

Abstract

To infer a causal relationship between two traits, several correlation-based causal direction (CD) methods have been proposed with the use of SNPs as instrumental variables (IVs) based on GWAS summary data for the two traits; however, none of the existing CD methods can deal with SNPs with correlated pleiotropy. Alternatively, reciprocal Mendelian randomization (MR) can be applied, which however may perform poorly in the presence of (unknown) invalid IVs, especially for bi-directional causal relationships. In this paper, first, we propose a CD method that performs better than existing CD methods regardless of the presence of correlated pleiotropy. Second, along with a simple but yet effective IV screening rule, we propose applying a closely related and state-of-the-art MR method in reciprocal MR, showing its almost identical performance to that of the new CD method when their model assumptions hold; however, if the modeling assumptions are violated, the new CD method is expected to better control type I errors. Notably bi-directional causal relationships impose some unique challenges beyond those for uni-directional ones, and thus requiring special treatments. For example, we point out for the first time several scenarios where a bi-directional relationship, but not a uni-directional one, can unexpectedly cause the violation of some weak modeling assumptions commonly required by many robust MR methods. We also offer some numerical support and a modeling justification for the application of our new methods (and more generally MR) to binary traits. Finally we applied the proposed methods to 12 risk factors and 4 common diseases, confirming mostly well-known uni-directional causal relationships, while identifying some novel and plausible bi-directional ones such as between body mass index and type 2 diabetes (T2D), and between diastolic blood pressure and stroke.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35576237 PMCID： PMC9135345 DOI： 10.1371/journal.pgen.1010205

Source DB: PubMed Journal: PLoS Genet ISSN： 1553-7390 Impact factor: 6.020

1 Introduction

It is of great interest to infer causal relationships between pairs of complex traits or diseases such as for treatment/intervention development and drug repurposing [1, 2], which however is quite challenging and had barely been touched until recently. The availability of large-scale GWAS summary data and the use of SNPs as instrumental variables (IVs) in Mendelian randomization (MR) have made it possible for such inference [3-5]. However, most MR methods and analyses are based on a critical and strong assumption that there is only a uni-directional relationship between two traits and the direction is known; that is, by treating one trait as the exposure and the other as the outcome, one assumes that the causal relationship, if exists, can be only from the exposure to the outcome. To infer the causal direction between two traits (under the uni-directional assumption), recently several methods based on comparing correlations between SNPs/IVs and each trait have been proposed, including Steiger’s method based on a single SNP (that is assumed to be a valid IV) [6], CD-Ratio and CD-Egger based on multiple SNPs, which can be more powerful than Steiger’s method [7]. CD-Egger, similar to Egger regression in MR [8], is also more robust than the other two methods by allowing invalid IVs under the InSIDE assumption; that is, CD-Egger allows invalid IVs with uncorrelated pleiotropy, but not correlated pleiotropy [9]. The first goal here is to develop a correlation-based causal direction (CD) inference method based on constrained maximum likelihood, called CD-cML, and show its higher power and robustness than the above methods, especially in the presence of correlated pleiotropy. Given the wide-spread pleiotropy [10, 11], it is of utmost importance for any method to be robust to pleiotropy, especially correlated pleiotropy that is more challenging to deal with. However, the above CD methods are applicable to infer only uni-directional, but not bi-directional, causal relationships. In a bi-directional relationship, each of the two traits may be causal to the other at the same time. In practice, there may be bi-directional causal relations between some traits (e.g. between insomnia and some major psychiatric disorders [12]), or at least we may not be able to exclude a priori the possibility of such bi-directional relationships. Alternatively, reciprocal (also called bidirectional) MR can be applied by treating each of the two traits as the exposure while the other as the outcome [13, 14]. However, as shown in [7, 15], bidrectional MR (with the use of many MR methods) does not perform well due to some reasons, including the below one: assuming the true causal direction is from X to Y for two traits X and Y, if an SNP is causal to X (and the sample sizes are large enough), the SNP is associated with both X and Y, and thus may be considered as a candidate IV for both traits; when the SNP is used as an IV for direction X to Y, it will confirm the causal association; however, if it is used as an IV for direction Y to X, it will also yield a non-zero estimate of the causal effect of Y on X, leading to an incorrect conclusion. A naive remedy is to remove any SNP associated with both traits, but it leads to not only loss of power (with fewer SNPs as IVs), but also biased inference (e.g., towards the causal direction X to Y if the truth is Y to X and if the GWAS sample size or power for X is much larger than for Y) [16]. Here we adopt a simple but effective screening/filtering rule based on a simple heuristic: no SNP will be used as an IV for both traits, because no SNP can be valid for both traits. If an SNP is associated with both traits, by Steiger’s method, we use it only for the trait with which its absolute correlation is larger than that with the other trait (because it is more likely to be a valid IV for the chosen trait) [16]. Furthermore, there are some new MR methods, such as constrained maximum likelihood (MR-cML) [17], that are more robust to both uncorrelated and correlated pleiotropy. Our second goal here is to show that, by incorporating MR-cML and the IV screening rule in reciprocal MR, the resulting method, still called MR-cML for simplicity, performs well, in fact almost identically to CD-cML if their modeling assumptions hold; otherwise, CD-cML controls type I error better and is more conservative. With the two robust and powerful methods, we show their application to infer bi-directional relationships, which has been largely neglected in the literature. It is notable that inferring bi-directional causal relationships is far more challenging than uni-directional ones: for example, for the first time we point out that a bi-directional causal relationship generates a few new scenarios, in which either the InSIDE assumption or the plurality condition required by many existing robust MR methods will be violated (e.g. IVW (random effect), Egger regression [8] and RAPS [18] for the former; our cML methods, MR-ContMix [19], MR-Mix [20], MR-Lasso [21] and MR-Weighted Mode [22] for the latter). We applied the methods to 48 risk factor-complex disease pairs with 12 cardiometabolic risk factors, 3 cardiometabolic diseases (T2D, Stroke and CAD), and asthma (more as a negative control), identifying some interesting bi-directional causal relationships, such as between diastolic blood pressure and Stroke, and between body mass index and T2D [23].

2 Results

2.1 Overview of methods

Given two traits, X and Y, and two independent GWAS datasets for the two traits, our goal is to infer their possibly bi-directional causal relationship. One of the most challenging issues is that we have a hidden (i.e. unobserved) confounder (or equivalently, an aggregate of many hidden confounders) denoted by U, which is associated with both X and Y with effect sizes θ and θ respectively. There are three possible sets of candidate SNPs to be used as IVs: (1) {g} is the set of valid IVs for X, having direct effects α’s only on X; (2) {g} is the set of valid IVs for Y, having direct effects β’s only on Y; (3) {g} is the set of invalid IVs directly influencing both X and Y, and possibly U, with direct effects γ’, η’s, and ξ’s respectively. Fig 1 illustrates a true causal model, in which θ and θ are the causal effects between the two traits, the unknown parameters of interest.

Fig 1

The true causal model with two traits X and Y of interest.

The true causal model with two traits X and Y of interest.

U is a hidden confounder (or an aggregate of hidden confounders). The IVs in {g} and {g} are valid for X and Y respectively, and the IVs in {g} are invalid ones. The true classification of {g}, {g} and {g} is unknown and needs to be estimated. The arrows give the directions of the direct causal effects (with the effect sizes shown). In particular, θ and θ are the causal effects from X to Y and from Y to X respectively, and are parameters of interest. Define the (population) Pearson correlations between each candidate SNP/IV g and two traits as ρ = corr(X, g) and ρ = corr(Y, g). It is shown in Methods that with and . If there exists causal direction from X to Y, we have |K| < 1 (under a suitable condition shown in Methods) and K ≠ 0, which can be used to infer the causal direction of X to Y. Similarly, we use |K| < 1 and K ≠ 0 to infer the causal direction of Y to X. The idea is similar to that used by other CD methods, i.e. Steiger’s method based on a single valid IV, CD-Ratio on multiple valid IVs and CD-Egger on multiple possibly invalid IVs without correlated pleiotropy (i.e. when the InSIDE assumption holds) [6, 7]. Since K and K are unknown, we propose a constrained maximum likelihood, called CD-cML, to infer the two parameters and thus the causal direction. Briefly, based on the given GWAS (summary) data, we calculate the sample (Pearson) correlations between each candidate SNP/IV and each trait, say r and r, which are asymptotically normal and consistent for the (population) correlations ρ and ρ; with Eq (1), we can write down the normal-based log-likelihood under the constraint that the number of invalid IVs is equal to a given integer, say m ≥ 0. We try each possible value of m, then consistently select the best one based on the Bayesian Information Criterion (BIC). The resulting constrained maximum likelihood estimates (cMLEs), say and , are consistent for K and K respectively, and are asymptotically normal. Hence we can construct a normal-based confidence interval for K and K respectively, thus drawing inference on the two possible causal directions from X to Y and from Y to X. A similar method, called MR-cML, has been used to estimate θ or θ in the framework of MR [17]. It is noted that MR-cML performs well under correlated pleiotropy. Here we also propose applying MR-cML to reciprocal MR to infer both θ and θ, and thus infer a possibly bi-directional causal relationship between X and Y; for simplicity, we still call such a reciprocal MR as MR-cML. It is noted that MR and CD methods are related but different: For example, for direction of X to Y, MR methods are based on inferring whether θ = 0; in contrast, CD methods are based on whether both K = 0 and |K| < 1. Accordingly, because of the second constraint, we expect that sometimes CD-cML will be more conservative than MR-cML in terms of yielding smaller type I error and lower power. We also propose a simple but yet effective method for SNP/IV screening based on a simple heuristic: none of SNPs can be a valid IV for both X and Y. Thus, if an SNP is (marginally) associated with both traits, we will use it only for the trait with which it is more correlated than with the other trait. This screening rule is just a simple application of Steiger’s method [6], and was mentioned in [16]. This screening rule is especially useful in the presence of bi-directional causal relationships. For example, when inferring whether there is a causal direction from X to Y in Fig 1, if θ ≠ 0, all IVs in set {g} are associated with trait X and thus are candidate IVs, though they are all invalid IVs; the screening rule will eliminate them as IVs (because they will be more highly correlated with trait Y than with trait X; see Methods). Now we consider what happens if they are indeed used as IVs. Assuming the set size |{g}| is larger than that of the valid IV set, |{g}|, the invalid IV set {g} forms the largest (i.e. plurality) group in (incorrectly) estimating θ as 1/θ (asymptotically); in other words, they lead to the violation of the plurality condition required by cML (and several other MR methods, such as MR-ContMix [19], MR-Mix [20], MR-Lasso [21] and MR-Weighted Mode [22]). In addition, they will also lead to the violation of the InSIDE assumption: it is easy to verify that for any g ∈ {g}, its effect size on trait X is β = θ β, which is clearly correlated with β, its direct effect on Y. There is another source leading to the violation of the InSIDE assumption, in addition to the more widely recognized one with ξ ≠ 0 (i.e. some IVs are correlated with the hidden confounder) and the one pointed above. It is due to bi-directional causal relationships, again demonstrating that it is more challenging to deal with bi-directional relationships than with uni-directional ones. Consider the causal direction of X to Y: even if ξ = 0 but if θ ≠ 0, any SNP g ∈ {g} would lead to the violation of InSIDE, because its association strength with X and its direct effect size on Y respectively would be γ + ηθ and η, which are clearly correlated. The violation of the InSIDE assumption will lead to biased inference by several popular random-effects model-based methods (that treat direct effects as random effects), such as IVW (random effect), Egger regression [8] and RAPS [18]. Finally we apply the data perturbation (DP) scheme of [17] for better finite-sample inference: it accounts for uncertainty in selecting invalid IVs in CD- and MR-cML, leading to better control of type I errors. We suffix a method with “-DP” and “-S” to refer its use of data perturbation and IV screening respectively. See Section 4.6 for a summary of different methods.

2.2 Simulations

2.2.1 Main simulations

We generated simulated data following the true causal model in Fig 1 for two continuous traits X and Y. There were 15, 10 and 10 SNPs/IVs in sets {g}, {g} and {g}, respectively, with effect sizes α1 to α15, β1 to β10, γ1 to γ10, and η1 to η10 ranging in (−0.3, −0.2) and (0.2, 0.3) (from the corresponding uniform distributions) respectively. For more correlated pleiotropy, we generated ξ’s from a uniform distribution in the range of (−0.2, 0.2); otherwise, we set ξ’s at 0. The MAFs of the SNPs were 0.3. The random errors ϵ and ϵ were independently drawn from N(0, 1), and ϵ was from N(0, 2). We considered various combinations of the true causal effect sizes of θ and θ in the range from 0 to 0.3. For each set-up, we generated 500 pairs of two independent GWAS samples, one for each of the two traits and each of sample size n = N1 = N2 = 50000. We also studied the scenarios with at least one of X and Y being binary. To generate binary traits, we generated the continuous X and Y first, then dichotomized one or both of them by setting the largest 30% of their values to be 1 and the other 70% as 0. For each dataset, we first generated individual level data, then calculated the summary statistics with marginal linear regression for a continuous trait and marginal logistic regression for a binary trait. We set the significance cutoff at 0.05/35 to select relevant IVs for both traits before applying any CD and MR methods. We summarize the main results in terms of (empirical) type I error and power in Figs 2–5. Figs 2 and 4 show the results for both X and Y being continuous, while Figs 3 and 5 for both X and Y being binary; the results were similar regardless of the traits being continuous or binary, though it was slightly more powerful to use the continuous traits than the binary traits (due to the loss of information by dichotomizing a continuous trait). The top-left panels (for “θ = 0, X to Y”) show (empirical) type I error for the direction of X → Y; when θ = 0 in the right panels, it shows type I error for the direction Y → X; otherwise, it is for (empirical) power. In general, MR-cML-DP-S and CD-cML-DP-S performed almost identically: they could control type I error satisfactorily while having high power. In contrast, all other methods, namely CD-Ratio, CD-Egger and combining (single-SNP-based) Steiger’s method over multiple IVs (by majority voting, “-MV”), could not control type I error, and might have low power. Interestingly, CD-Egger had largely inflated type I error rates even when none of the IVs were correlated with the hidden confounder (i.e. ξ = 0) unless both θ = θ = 0 as shown in Figs 2 and 3. This might sound surprising, but convincingly showed that detecting bi-directional causal relationships is much more challenging than detecting uni-directional ones. As explained in Methods, even if ξ = 0, in the presence of correlated pleiotropy the InSIDE assumption required by Egger regression could be violated. On the other hand, when the SNPs from {g} were correlated with the hidden confounder (with ξ ≠ 0), the InSIDE assumption would always be violated, leading to inflated type I error of CD-Egger as shown in Figs 4 and 5. Notably, both MR-cML-DP-S and CD-cML-DP-S did not suffer from any of these problems.

Fig 2

Empirical type-I error and power (y-axis) for both X and Y continuous, ξ = 0 (i.e. no correlated pleiotropy), θ = 0 (top panels) and θ = 0.3 (bottom) for various values of θ (x-axis).