| Literature DB >> 28482034 |
Emidio Capriotti1, Piero Fariselli2.
Abstract
One of the major challenges in human genetics is to identify functional effects of coding and non-coding single nucleotide variants (SNVs). In the past, several methods have been developed to identify disease-related single amino acid changes but only few tools are able to score the impact of non-coding variants. Among the most popular algorithms, CADD and FATHMM predict the effect of SNVs in non-coding regions combining sequence conservation with several functional features derived from the ENCODE project data. Thus, to run CADD or FATHMM locally, the installation process requires to download a large set of pre-calculated information. To facilitate the process of variant annotation we develop PhD-SNPg, a new easy-to-install and lightweight machine learning method that depends only on sequence-based features. Despite this, PhD-SNPg performs similarly or better than more complex methods. This makes PhD-SNPg ideal for quick SNV interpretation, and as benchmark for tool development. AVAILABILITY: PhD-SNPg is accessible at http://snps.biofold.org/phd-snpg.Entities:
Mesh:
Year: 2017 PMID: 28482034 PMCID: PMC5570245 DOI: 10.1093/nar/gkx369
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.(A) Distribution of Pathogenic (red) and Benign (blue) single nucleotide variants (SNVs) along the chromosomes. *The size of the mitochondrial chromosome (M) in panel A is increased 2,500 times. (B) Schematic view of the PhD-SNPg algorithm and its input features. (C) Distribution of PhyloP100 scores in the loci where Pathogenic (red) Benign (blue) SNVs are detected. (D) Performance of PhD-SNPg (red), CADD (black) and FATHMM-MKL (blue) on the testing set (NewClinvar032016).
Performance of PhD-SNPg, FATHMM-MKL and CADD on the Clinvar012016 dataset
| Method | Dataset | Q2 | TNR | NPV | TPR | PPV | MCC | F1 | AUC |
|---|---|---|---|---|---|---|---|---|---|
|
| All | 0.88 | 0.81 | 0.82 | 0.91 | 0.91 | 0.72 | 0.91 | 0.93 |
| Coding | 0.88 | 0.74 | 0.77 | 0.92 | 0.91 | 0.67 | 0.92 | 0.92 | |
| Non-coding | 0.90 | 0.92 | 0.91 | 0.86 | 0.88 | 0.78 | 0.87 | 0.94 | |
|
| All | 0.84 | 0.67 | 0.79 | 0.91 | 0.85 | 0.61 | 0.88 | 0.88 |
| Coding | 0.83 | 0.58 | 0.70 | 0.91 | 0.86 | 0.53 | 0.89 | 0.86 | |
| Non-coding | 0.88 | 0.84 | 0.95 | 0.92 | 0.79 | 0.75 | 0.85 | 0.95 | |
|
| All | 0.90 | 0.90 | 0.81 | 0.90 | 0.95 | 0.78 | 0.93 | 0.95 |
| Coding | 0.91 | 0.85 | 0.80 | 0.93 | 0.95 | 0.77 | 0.94 | 0.94 | |
| Non-coding | 0.88 | 0.99 | 0.83 | 0.71 | 0.99 | 0.76 | 0.82 | 0.94 |
Q2: overall accuracy, TNR: true negative rate, NPV: negative predictive value, TPR: true positive rate, PPV: positive predicted value, MCC: Matthews correlation coefficient, AUC: area under the receiver operating characteristic curve. PhD-SNPg: performance evaluation measures (defined in Supplementary Materials) are averaged over five cross-validation tests (10-fold). The standard errors for all the performance measures are reported in Supplementary Table S6.
aFATHMM-MKL and CADD returned predictions respectively on 99.3% and 99.9% of the total dataset.
Performances of PhD-SNPg, FATHMM-MKL and CADD on the NewClinvar032016 dataset.
| Method | Dataset | Q2 | TNR | NPV | TPR | PPV | MCC | F1 | AUC |
|---|---|---|---|---|---|---|---|---|---|
|
| All | 0.86 | 0.77 | 0.88 | 0.93 | 0.85 | 0.72 | 0.88 | 0.92 |
| Coding | 0.85 | 0.67 | 0.85 | 0.94 | 0.85 | 0.65 | 0.89 | 0.91 | |
| Non-coding | 0.88 | 0.86 | 0.91 | 0.90 | 0.84 | 0.75 | 0.87 | 0.93 | |
|
| All | 0.78 | 0.58 | 0.85 | 0.93 | 0.75 | 0.55 | 0.83 | 0.85 |
| Coding | 0.81 | 0.58 | 0.82 | 0.94 | 0.81 | 0.57 | 0.87 | 0.86 | |
| Non-coding | 0.73 | 0.57 | 0.89 | 0.91 | 0.64 | 0.51 | 0.75 | 0.86 | |
|
| All | 0.86 | 0.82 | 0.85 | 0.89 | 0.87 | 0.72 | 0.88 | 0.92 |
| Coding | 0.86 | 0.70 | 0.85 | 0.94 | 0.86 | 0.68 | 0.90 | 0.91 | |
| Non-coding | 0.87 | 0.92 | 0.85 | 0.81 | 0.90 | 0.74 | 0.85 | 0.92 |
Q2: overall accuracy, TNR: true negative rate, NPV: negative predictive value, TPR: true positive rate, PPV: positive predicted value, MCC: Matthews correlation coefficient, AUC: area under the receiver operating characteristic curve. PhD-SNPg: performance evaluation measures (defined in Supplementary Materials) are averaged over five tests with previous Clinvar012016 models. The standard error for all the performance measures for PhD-SNPg is below 1%.
aFATHMM-MKL and CADD returned predictions respectively on 99.6% and 99.8% of the total dataset.