| Literature DB >> 27892471 |
Bettina Mieth1, Marius Kloft2, Juan Antonio Rodríguez3, Sören Sonnenburg4, Robin Vobruba1, Carlos Morcillo-Suárez3, Xavier Farré3, Urko M Marigorta5, Ernst Fehr6, Thorsten Dickhaus7, Gilles Blanchard8, Daniel Schunk9, Arcadi Navarro3,10,11, Klaus-Robert Müller1,12.
Abstract
The standard approach to the analysis of genome-wide association studies (GWAS) is based on testing each position in the genome individually for statistical significance of its association with the phenotype under investigation. To improve the analysis of GWAS, we propose a combination of machine learning and statistical testing that takes correlation structures within the set of SNPs under investigation in a mathematically well-controlled manner into account. The novel two-step algorithm, COMBI, first trains a support vector machine to determine a subset of candidate SNPs and then performs hypothesis tests for these SNPs together with an adequate threshold correction. Applying COMBI to data from a WTCCC study (2007) and measuring performance as replication by independent GWAS published within the 2008-2015 period, we show that our method outperforms ordinary raw p-value thresholding as well as other state-of-the-art methods. COMBI presents higher power and precision than the examined alternatives while yielding fewer false (i.e. non-replicated) and more true (i.e. replicated) discoveries when its results are validated on later GWAS studies. More than 80% of the discoveries made by COMBI upon WTCCC data have been validated by independent studies. Implementations of the COMBI method are available as a part of the GWASpi toolbox 2.0.Entities:
Year: 2016 PMID: 27892471 PMCID: PMC5125008 DOI: 10.1038/srep36671
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The COMBI method - Summary and illustration of the methodology.
Receiving genotypes and corresponding phenotypes of a GWAS as input, the COMBI method first applies a machine learning step to select a set of candidate SNPs and then calculates p-values and corresponding significance thresholds in a statistical testing step.
Tabular representation of single SNP data.
| A1A1 | A1A2 | A2A2 | ∑ | |
|---|---|---|---|---|
| ∑ |
Single SNP data are summarized in categories according to phenotypes (cases, Y = +1, and controls, Y = −1) and genotypes (A1A1, A1A2 and A2A2). The numbers nik denote the numbers of individuals within the corresponding groups. n is the total number of subjects in the study.
Figure 2Illustration of validation methodology.
After producing a list of associated SNPs via an appropriate inference method (i.e. COMBI or RPVT), the GWAS catalog is used in an independent validation step to confirm or refute those candidate SNPs accessing the predictability of the used inference method.
Figure 3Genome-wide scan for seven diseases.
Manhattan plots for all seven diseases resulting from the standard RPVT approach and the COMBI method as well as the SVM weights. We plot −log10 of the χ2 trend test p-values for both COMBI and RPVT and the corresponding SVM weights against position on each chromosome. Chromosomes are shown in alternating colours for clarity, with significant p-values highlighted in green. Please note that for the RPVT, the threshold indicated by the horizontal dashed line is fixed a priori genome-wide. For the COMBI method, it was determined chromosome-wise via the permutation-based threshold over the whole COMBI procedure. All panels are truncated at −log10 (p-value) = 15, although some markers exceed this significance threshold.
Association analysis of the SNPs reaching genome-wide significance applying the COMBI method.
| Disease | Chromosome | Identifier | χ2
| SVM weight | References (PMID) | |
|---|---|---|---|---|---|---|
| Bipolar disorder (BD) | 1 | 1.05e-05 | 0.0141 | YES | 19416921 | |
| 2 | 1.26e-05 | 0.0146 | YES | 21254220 | ||
| 2 | rs7570682 | 1.77e-06 | 0.0150 | YES | 21254220 | |
| 3 | 1.18e-05 | 0.0150 | YES | 21254220 | ||
| 14 | rs11622475 | 8.02e-06 | 0.0235 | YES | 21254220 | |
| 16 | 1.10e-05 | 0.0245 | YES | 21254220 | ||
| 9 | rs7860360 | 1.82e-06 | 0.0174 | |||
| 20 | rs3761218 | 7.15e-06 | 0.0243 | YES | 21254220 | |
| Coronary artery disease | 5 | 1.35e-05 | 0.0174 | YES | 21804106 | |
| (CAD) | 6 | 1.22e-05 | 0.0145 | YES | 17634449 | |
| 9 | rs1333049 | 1.12e-13 | 0.0262 | YES | 21606135 | |
| 22 | rs688034 | 2.75e-06 | 0.0287 | |||
| Crohn’s disease (CD) | 1 | rs11805303 | 6.35e-12 | 0.0234 | ||
| 1 | 1.02e-05 | 0.0142 | YES | 17554261 | ||
| 2 | rs10210302 | 4.52e-14 | 0.0224 | YES | 23128233 | |
| 3 | rs11718165 | 2.04e-08 | 0.0163 | YES | 21102463 | |
| 5 | rs6596075 | 3.11e-06 | 0.0168 | |||
| 5 | rs17234657 | 2.42e-12 | 0.0305 | YES | 18587394 | |
| 7 | 1.08e-05 | 0.0160 | ||||
| 9 | 1.61e-05 | 0.0201 | YES | 21102463 | ||
| 10 | rs10883371 | 5.23e-08 | 0.0227 | YES | 21102463 | |
| 16 | rs2076756 | 7.55e-15 | 0.0361 | YES | 21102463 | |
| 18 | rs2542151 | 1.93e-07 | 0.0246 | YES | 18587394 | |
| Hypertension (HT) | 1 | rs2820037 | 7.41e-07 | 0.0155 | ||
| 12 | 1.58e-05 | 0.0197 | ||||
| 15 | rs2398162 | 6.01e-06 | 0.0230 | |||
| Rheumatoid arthritis | 1 | rs6679677 | <1.0e-15 | 0.0243 | YES | 20453842 |
| (RA) | 4 | rs3816587 | 7.28e-06 | 0.0163 | ||
| 6 | rs9272346 | 7.38e-14 | 0.0239 | |||
| Type 1 diabetes (T1D) | 1 | rs6679677 | <1.0e-15 | 0.0247 | YES | 19430480 |
| 2 | rs231726 | 1.43e-06 | 0.0129 | |||
| 4 | rs17388568 | 3.07e-06 | 0.0175 | YES | 21829393 | |
| 5 | rs17166496 | 5.97e-06 | 0.0148 | |||
| 6 | rs9272346 | <1.0e-15 | 0.0792 | YES | 18978792 | |
| 7 | 1.03e-05 | 0.0172 | ||||
| 12 | rs17696736 | 1.55e-14 | 0.0223 | YES | 18978792 | |
| 12 | rs11171739 | 8.36e-11 | 0.0244 | YES | 19430480 | |
| 16 | rs12924729 | 7.86e-08 | 0.0285 | YES | 17554260 | |
| Type 2 diabetes (T2D) | 2 | 1.00e-05 | 0.0159 | YES | 20418489 | |
| 4 | rs1481279 | 9.44e-06 | 0.0173 | |||
| 4 | rs7659604 | 9.61e-06 | 0.0175 | |||
| 6 | rs9465871 | 3.38e-07 | 0.0162 | |||
| 10 | rs4506565 | 5.01e-12 | 0.0267 | YES | 23300278 | |
| 12 | rs1495377 | 7.21e-06 | 0.0196 | |||
| 16 | rs7193144 | 4.15e-08 | 0.0293 | YES | 22693455 | |
| 18 | rs1025450 | 1.98e-06 | 0.0271 |
For all seven diseases we present SNPs reaching genome-wide significance along with their rs-identifier, corresponding chromosome, χ2 trend test p-value, SVM weight and the result of the validation pipeline indicating whether the SNP has been found significant with a p-value < 10−5 in at least one external GWAS or meta-analysis. PMID references of those studies are given in the last column. SNPs that do not show genome-wide significance in the case of RPVT are highlighted in bold case.
Figure 4ROC and PR curves of the RPVT approach, the COMBI method and raw SVM weights using the independent validation pipeline as an indicator of replicability.
The results of all seven diseases have been pooled. The curves have been generated based on the replication of SNPs according to the GWAS catalog. Replicated reported associations are counted as true positives, and non-replicated associations as false positives. Note that the COMBI lines end at some point and the RPVT and the raw SVM lines continue. At the endpoint of the COMBI curve all SNPs selected in the SVM step are also significant in the statistical testing step; i.e. if one wanted to add just one more SNP to the list of reported associations, all other SNPs would also become significant, as they have a p-value of 1. The points on the RPVT and COMBI lines represent the final results of the two methods when applying the corresponding significance thresholds and are described in more detail in Table 3.
Empirical evaluation of the performance of the COMBI method on the WTCCC data, relative to that of basic RPVT.
| Number of SNPs reaching significance applying | ||
|---|---|---|
| RPVT | COMBI Method | |
| SNPs that have achieved <10−5 in at least one external study | 24 (32% precision) | 28 (61% precision) |
| SNPs that have not achieved <10−5 in an external study | 52 (68% error) | 18 (39% error) |
| Overall | 76 | 46 |
| 0.0014 | ||
The table represents the information given by the points on the RPVT and COMBI lines in Fig. 4. The final results of the two methods when applying the corresponding significance thresholds are shown. At significance threshold t* = 10−5, COMBI achieves 28 SNPs recall at precision 61%, while RPVT achieves a recall of only 24 SNPs at precision 32%.