| Literature DB >> 24695491 |
Vincent Botta1, Gilles Louppe1, Pierre Geurts1, Louis Wehenkel1.
Abstract
The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification. Variable importance results and T-Trees source code are all available at www.montefiore.ulg.ac.be/~botta/ttrees/ and github.com/0asa/TTree-source respectively.Entities:
Mesh:
Year: 2014 PMID: 24695491 PMCID: PMC3973686 DOI: 10.1371/journal.pone.0093379
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1A closer look into a T-Tree test-node.
The group 1 is tested. Out of this group, three SNPs are exploited by the weak learner. In red (resp. green), probability of being a case (resp. control) estimated by the weak-learner.
Comparison between the two methods.
|
|
| |||
|
| RF | TT | RF | TT |
| 100 | 0.683 | 0.921 | 0.628 | 0.719 |
| 500 | 0.799 | 0.942 | 0.675 | 0.747 |
| 1000 | 0.845 |
| 0.684 |
|
| 2500 | 0.888 | 0.944 | 0.700 | 0.746 |
| 5000 | 0.909 | 0.938 | 0.698 | 0.740 |
| 10000 |
| – | 0.697 | – |
| 25000 | 0.907* | – |
| – |
| 50000 | 0.898* | – | 0.698* | – |
Predictive performance of RF and TT for different values of , tuned value for and . Best AUC values for each column are underlined; best AUC values for each dataset variant are shown in bold. For TT, and . ( corresponds to RF with and ); - TT was not applied for values of ; both for computational efficiency reasons.).
T-Trees: block map and internal complexity influence.
|
|
|
|
|
| 10 | 1 | 0.906 | 0.717 |
| 5 |
|
| |
| 10 | 0.945 | 0.749 | |
| 20 | 5 |
|
|
| 10 | 0.945 | 0.740 | |
| 20 | 0.931 | 0.706 | |
| 50 | 5 |
| 0.742 |
| 10 | 0.937 |
| |
| 25 | 0.913 | 0.700 |
Effect of block size and internal complexity , for , , and . Maxima for each block size are highlighted in bold.
T-Trees: contiguous versus randomized blocks.
|
|
| |||
|
| contig. | rand. | contig. | rand. |
| 100 |
| 0.753 |
| 0.600 |
| 500 |
| 0.835 |
| 0.625 |
| 1000 |
| 0.853 |
| 0.627 |
Predictive performance of TT with , and , using contiguous blocks of SNPs versus random blocks of 10 SNPs. Breaking the structure using randomized block maps drastically deteriorates the results.
Predictive power: auc comparisons on the six other WTCCC datasets.
|
|
| |||
| rf | tt | rf | tt | |
|
| 0.743 |
| 0.918 |
|
|
| 0.756 |
| 0.998 |
|
|
| 0.807 |
| 0.938 |
|
|
| 0.806 |
| 0.993 |
|
|
| 0.860 |
| 0.900 |
|
|
| 0.758 |
| 0.959 |
|
Predictive power of RF and TT on two variants of the other wtccc datasets. The columns corresponds to the ''''-like filtered variant and the to the weakly filtered variant. (Parameters settings: RF: , , ; TT: , , , ).
Comparison of the predictive power of tree-based methods and linear models.
|
|
| |
| RF |
|
|
| TT |
|
|
|
| 0.661 | 0.648 |
|
| 0.739 | 0.729 |
| SGD-L1 | 0.623 | 0.613 |
| SGD-L2 | 0.643 | 0.635 |
| Logit | 0.648 | 0.638 |
| Poly | 0.716 | – |
| LassoSVM | 0.762 | – |
Regions highlighted from the top 200 SNPs according to SNP importances with RF and T-Trees on CD.
| Random Forests | ||||
| chr | size | rsid | trend p-value | importance |
| 1 | 10 | rs112090261,4,5 ( | 8.24 · 10−18 | 1.40 · 10−2 (1) |
| 2 | 2 | rs37550762 | 5.18 · 10−1 | 5.30 · 10−4 (48) |
| 2 | 17 | rs118878273,4 | 2.42 · 10−8 | 1.27 · 10−3 (20) |
| 2 | 5 | rs102103021,4,5 ( | 2.22 · 10−13 | 2.79 · 10−3 (6) |
| 3 | 6 | rs117181651,5 ( | 1.70 · 10−6 | 1.19 · 10−3 (24) |
| 4 | 2 | rs170459354 ( | 5.28 · 10−2 | 6.45 · 10−4 (39) |
| 5 | 3 | rs16893874 | 3.18 · 10−5 | 3.32 · 10−4 (80) |
| 5 | 12 | rs172346571,4,5 | 1.72 · 10−13 | 2.26 · 10−3 (10) |
| 5 | 2 | rs17149128 ( | 4.10 · 10−1 | 1.97 · 10−4 (166) |
| 5 | 4 | rs9310585 | 1.53 · 10−8 | 5.83 · 10−4 (44) |
| 6 | 2 | rs600382 | 2.38 · 10−5 | 2.67 · 10−4 (95) |
| 8 | 4 | rs102169094 | 7.76 · 10−5 | 3.04 · 10−4 (87) |
| 10 | 2 | rs16919914 | 2.22 · 10−1 | 5.20 · 10−4 (49) |
| 11 | 2 | rs1533339 ( | 2.78 · 10−4 | 2.15 · 10−4 (145) |
| 16 | 4 | rs20767561,4,5 ( | 3.95 · 10−15 | 3.88 · 10−3 (4) |
| 23 | 2 | rs6522332 | 3.23 · 10−1 | 2.08 · 10−4 (155) |
| 7 | 1 | rs8347712,6 | 1.25 · 10−3 | 1.91 · 10−4 (177) |
| 8 | 1 | rs109578182,6 | 2.62 · 10−5 | 2.13 · 10−4 (151) |
| 14 | 1 | rs49036042,6 | 2.48 · 10−3 | 2.89 · 10−4 (89) |
| 18 | 1 | rs25421511,5,6 | 7.21 · 10−8 | 2.07 · 10−4 (156) |
|
| ||||
|
|
|
|
|
|
| 1 | 2 | rs12409315 | 2.54 · 10−3 | 4.36 · 10−4 (32) |
| 1 | 10 | rs112090261,4,5 ( | 8.24 · 10−18 | 5.23 · 10−3 (5) |
| 1 | 2 | rs11162341 | 8.99 · 10−1 | 2.28 · 10−4 (57) |
| 1 | 5 | rs6677092 ( | 1.77 · 10−4 | 4.15 · 10−4 (33) |
| 2 | 35 | rs118878273,4 | 2.42 · 10−8 | 1.03 · 10−2 (1) |
| 2 | 2 | SNP_A-2293058 | 1.79 · 10−5 | 1.81 · 10−4 (78) |
| 2 | 5 | rs102103021,4,5 ( | 2.22 · 10−13 | 3.07 · 10−4 (48) |
| 3 | 2 | rs17047422 | 3.45 · 10−4 | 1.91 · 10−4 (73) |
| 3 | 2 | rs6774 ( | 1.39 · 10−2 | 3.41 · 10−4 (43) |
| 3 | 2 | rs4686733 | 3.65 · 10−1 | 1.39 · 10−4 (93) |
| 4 | 2 | rs1872321 | 6.88 · 10−9 | 1.19 · 10−3 (17) |
| 4 | 2 | rs170459354 ( | 5.28 · 10−2 | 2.57 · 10−4 (53) |
| 4 | 3 | rs1595154 | 1.08 · 10−7 | 5.70 · 10−4 (28) |
| 5 | 10 | rs172346571,4,5 | 1.72 · 10−13 | 4.55 · 10−4 (30) |
| 6 | 2 | rs168846935 | 1.21 · 10−3 | 9.36 · 10−5 (145) |
| 6 | 3 | rs2784899 | 6.48 · 10−2 | 1.26 · 10−4 (106) |
| 7 | 2 | rs10270692 | 9.31 · 10−2 | 1.99 · 10−4 (68) |
| 7 | 9 | rs69475793 | 8.54 · 10−1 | 7.55 · 10−3 (3) |
| 8 | 2 | rs102169094 | 7.76 · 10−54 | 1.03 · 10−4 (131) |
| 10 | 2 | rs11011417 | 1.85 · 10−5 | 1.31 · 10−4 (100) |
| 11 | 2 | rs9804490 | 2.41 · 10−5 | 1.16 · 10−4 (117) |
| 12 | 2 | rs11613902 ( | 9.43 · 10−1 | 3.46 · 10−4 (41) |
| 14 | 4 | rs10144260 | 1.18 · 10−9 | 1.07 · 10−3 (18) |
| 14 | 2 | rs2819467 ( | 1.51 · 10−3 | 1.23 · 10−4 (110) |
| 16 | 3 | rs20767561,4,5 ( | 3.95 · 10−15 | 6.43 · 10−4 (25) |
| 23 | 8 | rs5904497 ( | 4.41 · 10−2 | 3.26 · 10−3 (9) |
| 23 | 2 | rs6624585 ( | 2.69 · 10−2 | 2.24 · 10−4 (58) |
| 3 | 1 | rs117181651,5,6 ( | 1.70 · 10−6 | 7.93 · 10−5 (159) |
| 5 | 1 | rs22799802,6 | 6.19 · 10−5 | 7.03 · 10−5 (182) |
| 8 | 1 | rs109578182,6 | 2.62 · 10−5 | 1.06 · 10−4 (126) |
| 18 | 1 | rs25421511,5,6 | 7.21 · 10−8 | 9.35 · 10−5 (146) |
Regions highlighted from the top 200 SNPs according to SNP importances with RF (top) and T-Trees (bottom) on CD. Each row corresponds to a set of SNPs obtained by merging contiguous SNPs in the rankings that are not separated by more than 20 SNPs. For readability, only groups of more than 2 SNPs appear in the tables. Markers that are isolated but reported as associated in [25] are nevertheless compiled at the bottom of both tables ). For each region, the columns provide the chromosome number, the number of important SNPs in the region, the most important SNP in the region (and its gene name if provided by PheGenI [40]), the p-value of this SNP and its importance. (1) and (2): the regions reported as strongly (with a trend or a genotypic p-value <10−5) and moderately (with a trend or a genotypic p-value between 10−5 and 10−4) associated in [25]. (5): also reported by [37]. (4): regions identified by both RF and T-Trees. (3): the two novel regions mainly spotted by T-Trees.
Figure 2Group and variable importances for the two novel candidate regions for Crohn's disease.
Regions 2p12 (top) and 7q31 (bottom), as found by T-Trees on . First row: SNP and block importances. Second row: univariate (Fisher) p-values and haplotype p-values as derived from the case/control omnibus test with degrees of freedom where corresponds to the number of common haplotypes (a haplotype is said to be common if its frequency is greater than in the population under study). Third row: number of haplotypes in each block. Bottom plot: ld pattern () in the regions.