| Literature DB >> 28129774 |
Maarten van Iterson1, Erik W van Zwet2, Bastiaan T Heijmans3.
Abstract
We show that epigenome- and transcriptome-wide association studies (EWAS and TWAS) are prone to significant inflation and bias of test statistics, an unrecognized phenomenon introducing spurious findings if left unaddressed. Neither GWAS-based methodology nor state-of-the-art confounder adjustment methods completely remove bias and inflation. We propose a Bayesian method to control bias and inflation in EWAS and TWAS based on estimation of the empirical null distribution. Using simulations and real data, we demonstrate that our method maximizes power while properly controlling the false positive rate. We illustrate the utility of our method in large-scale EWAS and TWAS meta-analyses of age and smoking.Entities:
Keywords: Bias; Empirical null distribution; Epigenome- and transcriptome-wide association studies; Gibbs sampler; Inflation; Meta-analysis
Mesh:
Year: 2017 PMID: 28129774 PMCID: PMC5273857 DOI: 10.1186/s13059-016-1131-9
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Inflated epigenome- and transcriptome-wide association studies. Quantile-quantile (QQ) plots for EWAS (panels a and b) and TWAS (panels c and d) performed on the LifeLines (LL) and Leiden Longevity Study (LLS) cohorts for the phenotypes age and smoking status. Results for LL are indicated in green and LLS in orange. QQ-plots show the observed minus log10-transformed P values obtained from a linear model corrected for known biological and technical covariates against quantiles from the theoretical null distribution. Strong inflation, as estimated according to [9], was observed for both EWAS and TWAS of age, while for the EWAS and TWAS of smoking the amount of inflation is smaller (notice different y-axis scales)
Fig. 2The genomic inflation factor overestimates inflation in the presence of a moderate proportion of true associations. The box-plot summarizes the estimated inflation for simulated data with different amounts of true associations. One hundred sets of test statistics were generated with different amounts of true associations (20%, 10% and 5%) but without any true inflation; i.e., the inflation factor should be equal to one. The genomic inflation factor was calculated using [9]. A clear dependence on the number of true associations is seen for the genomic inflation factor
Fig. 3Bias in transcriptome-wide association studies. Histogram of test statistics from the TWAS of age in the LifeLines (LL) cohort. Each panel shows a different null distribution. a Theoretical null (green): normal distribution with mean and variance (0.0, 1.02), b empirical null (brown): normal distribution with estimated mean and variance using our Bayesian method (0.23, 1.52), c inflated null (purple): normal distribution with zero mean and variance equal to the estimated inflation estimated using the genomic control method (0.0, 1.52), and d permutation null (pink): normal distribution with permutation-based estimates of mean and variance (−0.006, 1.12). For comparison the theoretical null (green) is shown in each panel
Fig. 4Histogram of test statistics for TWAS on age (a and b) and smoking status (c and d) performed on two cohorts: LifeLines (LL) and Leiden Longevity Study (LLS). The lines represent the three-component normal mixture fitted as estimated using our Bayesian method. The black line represents the fit of the mixture, the red line the fit of the null component (the empirical null distribution with estimated mean and variance reported). The blue and green lines represent the estimated fits of the alternative components (proportion of positively and negatively associated features)
Correction for unobserved covariates reduces test-statistic bias and inflation
| Method | Genomic infl. factor | Bayesian infl. factor (bias) |
|---|---|---|
|
| ||
| 1. No | 1.322 | 1.229 (0.000) |
| 2. Known | 1.237 | 1.169 (0.080) |
| 3. PC (1) | 1.257 | 1.183 (0.048) |
| 4. PC (2) | 1.222 | 1.147 (-0.002) |
| 5. PC (3) | 1.160 | 1.090 (-0.139) |
| 6. SVA (3) | 1.181 | 1.116 (0.022) |
| 7. RUV-Res (3) | 1.332 | 1.166 (0.086) |
| 8. RUV-Emp (3) | 1.197 | 1.130 (0.021) |
| 9. CATE (2) | 1.161 | 1.077 (0.053) |
Genomic inflation factor estimates (, square root since the test statistics follow a normal distribution and not a χ 2) and inflation factor (and bias) estimates obtained using the Bayesian estimation of the empirical null distribution from test statistics obtained by fitting linear models for a TWAS of age in the Leiden Longevity Study (LLS) cohort subset of 500 individuals. Nine different models were fitted using different approaches to estimate and correct for unobserved covariates: (1) only known covariates, (2) including known covariates, (3), (4), and (5) known covariates plus one, two, or three principal component(s), respectively, (6) known covariates plus three optimal surrogate variables estimated using SVA [26], (7) known covariates plus three unobserved covariates estimated using RUV [32] with the residual method, (8) known covariates plus three unobserved covariates estimated using RUV [32] with the empirical method, (9) known covariates plus two optimal latent variables estimated using CATE [17] (within parentheses the number of principal components, optimal number of surrogate variables, or optimal number of latent factors)
Bias and inflation correction after adjustment for confounding factors yields optimal power
| Method | False positive rate | Power |
|---|---|---|
| mean (stdev) | mean (stdev) | |
| No confounding adjustment | ||
| No correction | 0.720 (0.0360) | 0.720 (0.049) |
| Genomic control | 0.001 (0.0020) | 0.005 (0.007) |
| Bayesian control | 0.029 (0.0076) | 0.050 (0.018) |
| Confounding adjustment | ||
| No correction | 0.060 (0.0056) | 0.860 (0.037) |
| Calibration | 0.030 (0.0042) | 0.770 (0.053) |
| Bayesian control | 0.058 (0.0052) | 0.860 (0.041) |
| oracle | 0.052 (0.0052) | 0.850 (0.039) |
Mean and standard deviation of the number of false positives and true positives (power) for a simulation study repeated 100×. Data were generated according to the simulation setup of Wang et al. [17]. The table summarizes the results for the naive approach of no adjustment for confounding factors and adjusting for confounding factors using CATE. Both in combination with different approaches are used to control for inflation (and bias): no correction, correction using genomic control, correction using the median and median absolute deviation (MAD), calibration [17], and using our Bayesian method. As a comparison the oracle method is shown where the simulated confounding factors have been added to the linear model
Empirical null estimates from correlated test statistics yield proper control of the false positives rate without any reduction in power
| Method | False positive rate | Power |
|---|---|---|
| mean (stdev) | mean (stdev) | |
| Uncorrelated | ||
| No correction | 0.050 (0.003) | 0.770 (0.020) |
| Genomic control | 0.028 (0.003) | 0.710 (0.020) |
| Bayesian control | 0.052 (0.003) | 0.770 (0.020) |
| Correlated | ||
| No correction | 0.040 (0.030) | 0.770 (0.020) |
| Genomic control | 0.023 (0.006) | 0.730 (0.090) |
| Bayesian control | 0.054 (0.020) | 0.800 (0.060) |
Mean and standard deviation of the number of false positives and true positives (power) for a simulation study repeated 100×. Correlated test statistics were generated according to the simulation setup of Efron [51]. The table summarizes the results for uncorrelated test statistics and correlated test statistics, without any correction for inflation or bias, using genomic control and using our Bayesian method
Bias and inflation of test statistics for EWAS and TWAS across four cohorts on age and smoking status
| EWAS | TWAS | ||||
|---|---|---|---|---|---|
| Age | Smoking | Age | Smoking | ||
| infl. bias | infl. bias | infl. bias | infl. bias | ||
| Uncorrected | CODAM | 1.17 0.100 (1.19) | 1.02 0.040 (1.03) | 1.13 -0.030 (1.20) | 1.05 0.100 (1.06) |
| LL | 1.45 -0.500 (1.94) | 1.07 0.009 (1.08) | 1.17 0.040 (1.39) | 1.15 0.080 (1.22) | |
| LLS | 1.30 0.100 (1.36) | 1.05 -0.200 (1.08) | 1.18 0.050 (1.26) | 1.15 -0.010 (1.17) | |
| RS | 1.34 0.700 (1.57) | 0.99 -0.100 (1.01) | 1.11 -0.005 (1.12) | 1.10 -0.010 (1.12) | |
| Corrected | CODAM | 1.01 -0.000 (1.01) | 1.00 0.000 (1.01) | 1.02 -0.010 (1.06) | 1.00 0.000 (1.00) |
| LL | 1.00 -0.000 (1.27) | 1.00 0.000 (1.01) | 1.02 0.010 (1.19) | 1.02 0.010 (1.06) | |
| LLS | 1.02 0.007 (1.05) | 1.00 -0.003 (1.01) | 1.03 0.001 (1.07) | 1.02 -0.010 (1.02) | |
| RS | 1.00 0.000 (1.02) | 0.99 0.000 (1.01) | 1.02 -0.006 (1.01) | 1.01 0.001 (1.02) | |
| 1.19 -0.030 (1.47) | 1.05 0.020 (1.10) | 1.04 0.030 (1.28) | 1.06 -0.002 (1.14) | ||
| meta-analysis | |||||
The table shows the bias and inflation as obtained using Bayesian method to estimate the empirical null and (within parentheses) using the genomic inflation factor both before correction and after correction for inflation (and bias in case of empirical control). The estimated inflation for the meta-analysis results are after control for inflation and bias in the individual cohorts and (within parentheses) inflation after applying genomic control. Sample sizes of the cohorts for EWAS/TWAS were n =164/181 (CODAM), n =744/605 (LL), n =683/589 (LLS), and n =612/535 (RS)
Fig. 5Manhattan plots meta-analyses across four cohorts of EWAS and TWAS of age and smoking status. Panel a shows the meta-analysis results of the EWAS of age as − log10P values and with reverse sign for the TWAS of age as log10P values. Panel b shows the same figure for smoking. The black line indicates 0.05 Bonferroni thresholds. Red gene names highlight the top 10 (nearest) genes resulting from the EWAS and TWAS. Black gene names denote genes that were identified in both the EWAS and TWAS (genes for EWAS are the genes closest to the significant CpG)