| Literature DB >> 26300220 |
Sarah A Gagliano1,2,3, Reena Ravji1, Michael R Barnes4, Michael E Weale5, Jo Knight1,2,3,6.
Abstract
Although technology has triumphed in facilitating routine genome sequencing, new challenges have been created for the data-analyst. Genome-scale surveys of human variation generate volumes of data that far exceed capabilities for laboratory characterization. By incorporating functional annotations as predictors, statistical learning has been widely investigated for prioritizing genetic variants likely to be associated with complex disease. We compared three published prioritization procedures, which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding. We also explored different combinations of algorithm and annotation set. As an application, we tested which methodology performed best for prioritizing variants using data from a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64-0.71) in test set data, but there is more variability in the application to the schizophrenia GWAS. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome-scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved for risk variant prediction to address the impending bottleneck of the new generation of genome re-sequencing studies.Entities:
Mesh:
Year: 2015 PMID: 26300220 PMCID: PMC4642511 DOI: 10.1038/srep13373
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Comparison of the three papers.
| Gagliano | Ritchie | Kircher | |
|---|---|---|---|
| Functional annotations | n = 14 (ENCODE, eQTLs, PhastCons, Genic context…) | n = 174 (ENCODE, GERP, Genic context…) | n = 63 (expanded to 949) (Ensembl VEP, ENCODE, PolyPhen…) |
| Risk variants (“Hits”) | NHGRI GWAS Catalogue (p-value ≤ 5 × 10−8) | HGMD—“regulatory” | Simulated mutations under neutral model—“gap” sites |
| Non-risk variants (“Non-hits”) | union of common Illumina and Affymetrix GWAS panels | other variants in 1000 Genomes Project (for example, within 1kb of each HGMD variant) | high-frequency derived human alleles from 1000 Genomes |
| Classifier algorithm | Elastic net | Random forest | Support vector machine |
| Training protocol | 60% training. 40% reserved for testing | 100% training | 99% training. 1% reserved for testing |
The area under the curve (AUC) for the GWAS Catalogue comparisons, holding data and classifier constant, while varying algorithm and annotations.
| Annotations → | Gagliano | Ritchie | Kircher |
|---|---|---|---|
| Elastic Net | 0.67 [0.65–0.68] (0.67) | 0.65 [0.63–0.66] (0.67) | 0.71 [0.69–0.73] (0.74) |
| Random Forest (altered minimum node size) | 0.67 [0.65–0.68] (0.69) | 0.68 [0.66–0.69] (0.72) | 0.70 [0.68–0.72] (0.79) |
| Support Vector Machine (with prior feature selection) | 0.66 [0.65–0.68] (0.66) | 0.64 [0.63–0.66] (0.66) | 0.64 [0.61–0.66] (0.68) |
The 95% confidence interval based on 2000 bootstrap replicates (generated using the R package pROC) is shown in square brackets. The AUC in the training set is in parentheses.
Figure 1Violin plots showing class separation by prediction scores for the various comparisons using the GWAS Catalogue as the classifier.
Hits are variants in the GWAS Catalogue with a genome-wide significant p-value (p ≤ 5 × 10−8) and non-hits are those not present in the GWAS Catalogue, but are found on common GWAS arrays for comparison purposes. The non-scaled elastic net models are plotted. The adjusted minimum node size (10%) random forest models are plotted.
Figure 2Quantile-quantile plots of PGC1 sub-genome-wide-significant variants (5 × 10−8 < p < 1 × 10−6) stratified by prediction score for the various models based on the GWAS Catalogue classifier, and plotted by PGC2 p-values.
PGC1 p-values are plotted on the x-axis and PGC2 p-values are plotted on the y-axis. Models grouped by annotation set: Gagliano et al. (a) Ritchie et al. (b) and Kircher et al. annotations (c). The lower quartile genetic variants are those PGC1 sub-genome-wide-significant variants that were assigned the lowest prediction scores (in the first quartile), and the top quartile variants are those with the highest prediction scores (in the fourth quartile).
The area under the curve (AUC) for the HGMD comparisons, holding data and classifier constant, while varying algorithm and annotations.
| Annotations → | Gagliano | Ritchie | Kircher |
|---|---|---|---|
| Elastic Net | 0.66 [0.64–0.67] (0.65) | 0.87 [0.86–0.88] (0.88) | 0.88 [0.87–0.89] (0.88) |
| Random Forest (altered minimum node size) | 0.65 [0.64–0.66] (0.66) | 0.91 [0.90–0.92] (0.91) | 0.87 [0.86–0.88] (0.89) |
| Support Vector Machine (with prior feature selection) | 0.63 [0.62–0.64] (0.66) | 0.85 [0.83–0.86] (0.86) | 0.85 [0.84–0.86] (0.87) |
The 95% confidence interval based on 2000 bootstrap replicates (generated using the R package pROC) is shown in square brackets. The AUC in the training set is in parentheses.
The area under the curve (AUC) for the non-exonic HGMD comparisons, holding data and classifier constant, while varying algorithm and annotations.
| Annotations → | Gagliano | Ritchie | Kircher |
|---|---|---|---|
| Elastic Net | 0.65 [0.61–0.68] (0.66) | 0.77 [0.74–0.80] (0.78) | 0.79 [0.76–0.81] (0.80) |
| Random Forest (altered minimum node size) | 0.65 [0.61–0.68] (0.65) | 0.80 [0.77–0.82] (0.86) | 0.78 [0.75–0.80] (0.85) |
| Support Vector Machine (with prior feature selection) | 0.61 [0.58–0.65] (0.68) | 0.68 [0.65–0.72] (0.78) | 0.76 [0.73–0.78] (0.82) |
The 95% confidence interval based on 2000 bootstrap replicates (generated using the R package pROC) is shown in square brackets. The AUC in the training set is in parentheses.