| Literature DB >> 27142380 |
Kevin McGregor1,2, Sasha Bernatsky2, Ines Colmegna3, Marie Hudson2,4,5, Tomi Pastinen6,7, Aurélie Labbe1,8,9, Celia M T Greenwood10.
Abstract
BACKGROUND: Many different methods exist to adjust for variability in cell-type mixture proportions when analyzing DNA methylation studies. Here we present the result of an extensive simulation study, built on cell-separated DNA methylation profiles from Illumina Infinium 450K methylation data, to compare the performance of eight methods including the most commonly used approaches.Entities:
Keywords: Cell-type mixture; DNA methylation; Deconvolution; Matrix decomposition
Mesh:
Year: 2016 PMID: 27142380 PMCID: PMC4855979 DOI: 10.1186/s13059-016-0935-y
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Clustered heat map showing patterns of methylation in 46 SARD samples (columns) and 200 CpG sites (rows). The sites were selected to highlight the methylation differences between cell types. Consequently, the samples cluster by cell type: monocytes, B cells, and then T cells
Fixed parameters in the simulation design
| Parameter | Description |
|---|---|
|
| Number of CpGs chosen to be associated with phenotype in simulation |
|
| Mean of the cell-type-specific DMS effects for cell type |
|
| Standard deviation of cell-type-specific DMS effects for cell type |
|
| Variability of individual deviations at probe j in cell type k |
|
| Expected proportion of the mixture from cell types 1 and 2 when the phenotype |
|
| Average cell-type mixture proportions for cell types 1 and 2 for subjects with phenotype level |
|
| Precision of simulated cell mixture distributions. A greater value corresponds to more clearly defined differences in cell-type proportions with respect to the phenotype |
Parameter choices for the simulation scenarios
| Simulation scenario | ( | ( |
|
|
|
| |
|---|---|---|---|---|---|---|---|
| 1 | Distinct differences | (−0.05,0.5) | (0.05, 0.75) | Unif(0.1, 2) | 100 |
|
|
| 2 | No confounding | (0.25, 0.25) | (0.5, 0.5) | 0.1 | 100 |
|
|
| 3 | Opposite effects | (−0.75,0.75) | (0.1, 0.1) | 0.1 | 100 |
|
|
| 4 | High precision | (0.3, 0.1) | (0.1, 0.1) | 0.1 | 200 |
|
|
| 5 | Low precision | (0.3, 0.1) | (0.1, 0.1) | 0.1 | 10 |
|
|
| 6 | Continuous phenotype | (−0.05,0.25) | (0.05, 0.15) | 0.1 | 100 |
|
|
| 7 | Few assoc. sites | (1, 0.95) | (0.05, 0.05) | Unif(0.1, 2) | 100 |
|
|
| 8 | Many assoc. sites block 1 b | (0.1,0.4) | (0.01, 0.01) | Unif(0.1, 2) | 100 |
|
|
| Many assoc. sites block 2 c | (0.2,0.7) | (0.01, 0.01) | Unif(0.1, 2) | 100 |
|
|
k=1 corresponds to monocytes, and k=2 corresponds to CD4 + T cells
aAverage change in cell-type proportion for unit increase in phenotype
bBackground correlation for block 1 was 0.4
cBackground correlation for block 2 was 0.5
Fig. 2QQ plots showing distributions of p values in simulation scenario 1 where the true effects in the different cell types have very distinct distributions. Results are shown with no adjustment for cell-type mixture as well as with eight other methods; these are split across two panels, (a) and (b), to clarify the display
Performance metrics under simulation scenario 1 (distinct associations between cell types)
| Method | Number of false positives | Power | KS | GIF |
|
|---|---|---|---|---|---|
| Unadjusted | 248 | 0.208 | 0.168 | 1.60 | – |
| Ref-based | 14 | 0.146 | 0.026 | 0.92 | – |
| Ref-free | 375 | 0.248 | 0.097 | 1.33 | 13 |
| SVA | 32 | 0.124 | 0.021 | 1.05 | 10 |
| ISVA | 14 | 0.134 | 0.004 | 0.97 | 12 |
| EWASher | 3 | 0.036 | 0.044 | 0.92 | – |
| CellCDec | 7 | 0.094 | 0.008 | 0.98 | – |
| Deconf | 13 | 0.178 | 0.023 | 0.93 | – |
| RUV | 36 | 0.170 | 0.018 | 1.04 | 43 |
KS Kolmogorov–Smirnov statistic, GIF genomic inflation factor
aEstimated latent dimension
Performance metrics under simulation scenario 2 (no confounding)
| Method | Number of false positives | Power | KS | GIF |
|
|---|---|---|---|---|---|
| Unadjusted | 38 | 0.642 | 0.059 | 1.09 | – |
| Ref-based | 13 | 0.594 | 0.030 | 0.90 | – |
| Ref-free | 366 | 0.704 | 0.058 | 1.25 | 14 |
| SVA | 79 | 0.646 | 0.015 | 1.06 | 10 |
| ISVA | 408 | 0.654 | 0.061 | 1.27 | 15 |
| EWASher | 0 | 0.126 | 0.066 | 0.83 | – |
| CellCDec | 14 | 0.650 | 0.021 | 0.93 | – |
| Deconf | 50 | 0.652 | 0.031 | 0.92 | – |
| RUV | 134 | 0.640 | 0.007 | 1.00 | 38 |
KS Kolmogorov–Smirnov statistic, GIF genomic inflation factor
Performance metrics under simulation scenario 3 (opposite effects)
| Method | Number of false positives | Power | KS | GIF |
|
|---|---|---|---|---|---|
| Unadjusted | 5 | 0.490 | 0.041 | 0.97 | – |
| Ref-based | 1 | 0.506 | 0.036 | 0.87 | – |
| Ref-free | 176 | 0.608 | 0.053 | 1.19 | 14 |
| SVA | 37 | 0.562 | 0.003 | 1.00 | 10 |
| ISVA | 49 | 0.558 | 0.006 | 1.01 | 14 |
| EWASher | 0 | 0.128 | 0.092 | 0.74 | – |
| CellCDec | 3 | 0.502 | 0.033 | 0.88 | – |
| Deconf | 2 | 0.522 | 0.035 | 0.89 | – |
| RUV | 23 | 0.532 | 0.012 | 1.00 | 37 |
KS Kolmogorov–Smirnov statistic, GIF genomic inflation factor
Performance metrics under simulation scenario 4 (high precision)
| Method | Number of false positives | Power | KS | GIF |
|
|---|---|---|---|---|---|
| Unadjusted | 368 | 0.600 | 0.142 | 1.32 | – |
| Ref-based | 64 | 0.460 | 0.043 | 1.04 | – |
| Ref-free | 437 | 0.694 | 0.101 | 1.36 | 13 |
| SVA | 79 | 0.534 | 0.032 | 1.09 | 11 |
| ISVA | 198 | 0.522 | 0.062 | 1.20 | 14 |
| EWASher | 16 | 0.072 | 0.038 | 0.94 | – |
| CellCDec | 54 | 0.546 | 0.031 | 1.05 | – |
| Deconf | 58 | 0.520 | 0.014 | 1.02 | – |
| RUV | 155 | 0.568 | 0.051 | 1.15 | 32 |
KS Kolmogorov–Smirnov statistic, GIF genomic inflation factor
Performance metrics under simulation scenario 5 (low precision)
| Method | Number of false positives | Power | KS | GIF |
|
|---|---|---|---|---|---|
| Unadjusted | 127 | 0.622 | 0.241 | 1.47 | – |
| Ref-based | 354 | 0.654 | 0.198 | 1.49 | – |
| Ref-free | 596 | 0.670 | 0.145 | 1.48 | 12 |
| SVA | 123 | 0.644 | 0.091 | 1.23 | 6 |
| ISVA | 72 | 0.482 | 0.040 | 1.12 | 11 |
| EWASher | 1 | 0.052 | 0.030 | 0.92 | – |
| CellCDec | 169 | 0.598 | 0.008 | 1.27 | – |
| Deconf | 184 | 0.644 | 0.204 | 1.41 | – |
| RUV | 2943 | 0.654 | 0.228 | 1.93 | 33 |
KS Kolmogorov–Smirnov statistic, GIF genomic inflation factor
Performance metrics under simulation scenario 6 (continuous)
| Method | Number of false positives | Power | KS | GIF |
|
|---|---|---|---|---|---|
| Unadjusted | 135 | 0.532 | 0.081 | 1.37 | – |
| Ref-based | 13 | 0.498 | 0.031 | 0.99 | – |
| Ref-free | 125 | 0.618 | 0.061 | 1.17 | 14 |
| SVA | 29 | 0.502 | 0.006 | 0.99 | 10 |
| ISVA | 114 | 0.520 | 0.076 | 1.21 | 14 |
| CellCDec | 422 | 0.420 | 0.079 | 1.44 | – |
| Deconf a | – | – | – | – | – |
| RUV | 16 | 0.550 | 0.006 | 0.966 | 39 |
KS Kolmogorov–Smirnov statistic, GIF genomic inflation factor
aNo results were obtained with K=3 in the allowable time on the computational cluster Mammouth
Fig. 3Comparison of performance metrics over ten replications in Scenario 7, few associated DMS. a The number of non-differentially methylated sites with a raw p value less than 10−4. b Power at significance level 10−4. c Kolmogorov–Smirnov statistic. d Genomic inflation factor
Fig. 4Comparison of performance metrics over ten replications in the many associated DMSs scenario. a Number of non-differentially methylated sites with a raw p value less than 10−4. b Power at significance level 10−4. c Kolmogorov–Smirnov statistic. d Genomic inflation factor
Fig. 5QQ plots of −log10 p values from the ARCTIC study with different adjustment methods. These are split across two panels, (a) and (b), to clarify the display
Performance metrics for the ARCTIC data with the most significant probes removed (top 5 %)
| Method | KS statistic | GIF |
|
|---|---|---|---|
| Unadjusted | 0.0758 | 0.9142 | – |
| Ref-based | 0.0508 | 0.9907 | – |
| Ref-free | 0.0296 | 1.1499 | 32 |
| SVA | 0.0362 | 1.0962 | 15 |
| ISVA | 0.1208 | 1.6820 | 39 |
| EWASher | 0.7291 | 1.0164 | – |
| RUV with three components | 0.0906 | 1.2423 | – |
It was not possible to obtain results for the CellCDec and Deconf methods
KS Kolmogorov–Smirnov statistic, GIF genomic inflation factor
P values for sites previously found to be associated with smoking [33]
| Site | Unadj | Ref-Based | Ref-free | SVA | ISVA | EWASher a | RUV |
|---|---|---|---|---|---|---|---|
| cg06644428 | 2.83E-20 | 2.04E-21 | 1.96E-27 | 9.19E-27 | 4.41E-20 | – | 1.64E-22 |
| cg05951221 | 2.72E-43 | 1.21E-45 | 5.41E-56 | 3.96E-58 | 3.20E-45 | 5.77E-01 | 1.05E-53 |
| cg21566642 | 1.72E-49 | 6.93E-50 | 4.20E-54 | 6.59E-58 | 8.09E-54 | 6.45E-01 | 5.94E-58 |
| cg01940273 | 2.09E-35 | 3.30E-38 | 4.07E-43 | 7.07E-46 | 6.93E-42 | 3.07E-01 | 1.45E-44 |
| cg19859270 | 4.91E-13 | 2.52E-22 | 4.76E-35 | 2.21E-35 | 5.15E-28 | – | 2.16E-35 |
| cg05575921 | 1.58E-35 | 7.79E-39 | 7.89E-52 | 5.65E-41 | 1.34E-40 | – | 3.43E-41 |
| cg21161138 | 2.43E-07 | 2.99E-10 | 1.24E-25 | 1.68E-25 | 2.84E-21 | 9.05E-01 | 6.07E-21 |
| cg06126421 | 1.78E-23 | 1.06E-33 | 7.38E-34 | 7.37E-33 | 9.69E-34 | 7.05E-01 | 2.55E-36 |
| cg03636183 | 9.27E-21 | 6.74E-24 | 6.10E-37 | 1.51E-34 | 4.77E-30 | 3.09E-01 | 1.60E-31 |
aSeveral sites were filtered out by EWASher
Fig. 6Agreement of the proportion of top d CpGs declared significant in the different cell-type adjustment methods with the top d CpGs in the originally reported list of significant CpGs from the rheumatoid arthritis study
Fig. 7Computational time comparison. a Sample size. The latent dimension was estimated by the algorithms as needed. b Latent dimension. The sample size is fixed at 50
Comparison of some features of the methods for cell-type mixture adjustment
| Method | Phen. allowed a | Input values b |
| Link |
|---|---|---|---|---|
| Ref-based | Any | Beta | N/A |
|
|
| ||||
| Ref-free | Any | Beta | Estimated |
|
| SVA | Any | Beta or logit(beta) | Estimated |
|
| ISVA | Continuous | Beta or logit(beta) | Estimated |
|
| EWASher | Binary | Beta | Estimated |
|
| CellCDec | Not used | Beta | Input |
|
| Deconf | Not used | Beta or logit(beta) | Input |
|
| RUV | Any | Beta or logit(beta) | Estimated |
|
aWhat kinds of phenotype are allowed?
bDoes the method use methylation proportions (beta values)? Or logit transformed beta values?
cDoes the method estimate the number of latent cell types K, or is K input into the algorithm?