| Literature DB >> 26842889 |
David Salgado1,2, Jean-Pierre Desvignes1,2, Ghadi Rai1,2, Arnaud Blanchard1,2, Morgane Miltgen1,2, Amélie Pinard1,2, Nicolas Lévy1,2,3, Gwenaëlle Collod-Béroud1,2, Christophe Béroud1,2,3.
Abstract
Whole-exome sequencing (WES) is increasingly applied to research and clinical diagnosis of human diseases. It typically results in large amounts of genetic variations. Depending on the mode of inheritance, only one or two correspond to pathogenic mutations responsible for the disease and present in affected individuals. Therefore, it is crucial to filter out nonpathogenic variants and limit downstream analysis to a handful of candidate mutations. We have developed a new computational combinatorial system UMD-Predictor (http://umd-predictor.eu) to efficiently annotate cDNA substitutions of all human transcripts for their potential pathogenicity. It combines biochemical properties, impact on splicing signals, localization in protein domains, variation frequency in the global population, and conservation through the BLOSUM62 global substitution matrix and a protein-specific conservation among 100 species. We compared its accuracy with the seven most used and reliable prediction tools, using the largest reference variation datasets including more than 140,000 annotated variations. This system consistently demonstrated a better accuracy, specificity, Matthews correlation coefficient, diagnostic odds ratio, speed, and provided the shortest list of candidate mutations for WES. Webservices allow its implementation in any bioinformatics pipeline for next-generation sequencing analysis. It could benefit to a wide range of users and applications varying from gene discovery to clinical diagnosis.Entities:
Keywords: NGS; bioinformatics; mutation; nonsense; nonsynonymous; pathogenicity prediction; substitution; synonymous
Mesh:
Year: 2016 PMID: 26842889 PMCID: PMC5067603 DOI: 10.1002/humu.22965
Source DB: PubMed Journal: Hum Mutat ISSN: 1059-7794 Impact factor: 4.878
Comparison Between UMD‐Predictor and Other Predictors Using the Varibench–dbSNP [Sherry et al., 2001; Sasidharan Nair and Vihinen, 2013] Dataset (n = 17,329)
| SIFT | PPH2 | Provean | Mutation assessor | CONDEL | MutationTaster | CADD | UMD‐Predictor | |
|---|---|---|---|---|---|---|---|---|
| TP | 9,596 | 10,290 | 9,638 | 9,775 | 8,797 | 11,174 | 10,182 | 10,727 |
| TN | 2,805 | 3,045 | 3,088 | 3,162 | 3,287 | 2,937 | 3,214 | 4,024 |
| FP | 1,229 | 1,189 | 1,147 | 1,073 | 948 | 1,298 | 1,021 | 211 |
| FN | 3,498 | 2,803 | 3,456 | 3,319 | 4,297 | 1,920 | 2,912 | 2,367 |
| PPV | 0.89 | 0.90 | 0.89 | 0.90 | 0.90 | 0.90 | 0.91 | 0.98 |
| NPV | 0.45 | 0.52 | 0.47 | 0.49 | 0.43 | 0.60 | 0.52 | 0.63 |
| Sensitivity | 0.73 | 0.79 | 0.74 | 0.75 | 0.67 | 0.85 | 0.78 | 0.82 |
| Specificity | 0.70 | 0.72 | 0.73 | 0.75 | 0.78 | 0.69 | 0.76 | 0.95 |
| Accuracy | 0.72 | 0.77 | 0.73 | 0.75 | 0.70 | 0.81 | 0.77 | 0.85 |
| MCC | 0.38 | 0.46 | 0.41 | 0.44 | 0.39 | 0.52 | 0.48 | 0.69 |
| DOR | 6.3 | 9.7 | 7.7 | 9.0 | 7.2 | 12.6 | 11.2 | 86.6 |
| log(DOR) | 1.84 | 2.27 | 2.04 | 2.20 | 1.97 | 2.53 | 2.42 | 4.46 |
TP, true positives; TN, true negatives; FP, false positives; FN, false negatives; PPV, positive predictive value; NPV, negative predictive value; MCC, Matthews correlation coefficient; DOR, diagnostic odds ratio.
Figure 1log(DOR) comparison between predictors using the Varibench–dbSNP [Sherry et al., 2001; Sasidharan Nair and Vihinen, 2013] dataset (n = 17,329). X‐axis, sensitivity; Y‐axis, specificity; color‐coded scale, log(DOR).
Figure 2Sensitivity of methods in distinguishing pathogenic and nonpathogenic variants. Receiver operating characteristics (ROCs) curves including AUC for seven predictors using the Varibench–dbSNP [Sherry et al., 2001; Sasidharan Nair and Vihinen, 2013] dataset (n = 17,329).
Receiver Operating Characteristics (ROCs) Area for Seven Predictors Using the Varibench–dbSNP [Sherry et al., 2001; Sasidharan Nair and Vihinen, 2013] Dataset (n = 17,329)
| Confidence | ROC area | Standard error | Min ROC area | Max ROC area | |
|---|---|---|---|---|---|
| UMD‐Predictor | 0.95 | 0.954 | 0.002 | 0.950 | 0.957 |
| SIFT | 0.95 | 0.784 | 0.005 | 0.774 | 0.794 |
| PPH2_VAR | 0.95 | 0.826 | 0.004 | 0.819 | 0.834 |
| PROVEAN | 0.95 | 0.789 | 0.004 | 0.781 | 0.797 |
| CONDEL | 0.95 | 0.778 | 0.004 | 0.771 | 0.786 |
| MUT‐ASS | 0.95 | 0.813 | 0.004 | 0.806 | 0.820 |
| CADD | 0.95 | 0.834 | 0.004 | 0.827 | 0.841 |
Min ROC area, lower bound for the confidence interval of a vector of length two; Max ROC area, upper bound for the confidence interval of a vector of length two. All data were generated using the “ci.cvAUC” function of the “cvAUC” package (https://github.com/ledell/cvAUC) for the ROCR R‐package [Sing et al., 2005].
Figure 3Sensitivity of methods in distinguishing pathogenic and nonpathogenic variants. Receiver operating characteristics (ROCs) are shown discriminating pathogenic mutations from nonpathogenic mutations defined by Varibench–dbSNP [Sherry et al., 2001; Sasidharan Nair and Vihinen, 2013] dataset (n = 17,329). UMD‐Predictor values were obtained without the mutations’ frequency information.
Receiver Operating Characteristics (ROCs) Area for Seven Predictors Using the Varibench–dbSNP [Sherry et al., 2001; Sasidharan Nair and Vihinen, 2013] Dataset (n = 17,329)
| Confidence | ROC area | Standard error | Min ROC area | ax ROC area | |
|---|---|---|---|---|---|
| UMD‐Predictor | 0.95 | 0.828 | 0.004 | 0.821 | 0.836 |
| SIFT | 0.95 | 0.784 | 0.005 | 0.774 | 0.794 |
| PPH2_VAR | 0.95 | 0.826 | 0.004 | 0.819 | 0.834 |
| PROVEAN | 0.95 | 0.789 | 0.004 | 0.781 | 0.797 |
| CONDEL | 0.95 | 0.778 | 0.004 | 0.771 | 0.786 |
| MUT‐ASS | 0.95 | 0.813 | 0.004 | 0.806 | 0.820 |
| CADD | 0.95 | 0.834 | 0.004 | 0.827 | 0.841 |
Min ROC area, lower bound for the confidence interval of a vector of length two; Max ROC area, upper bound for the confidence interval of a vector of length two. All data were generated using the “ci.cvAUC” function of the “cvAUC” package (https://github.com/ledell/cvAUC) for the ROCR R‐package [Sing et al., 2005]. UMD‐Predictor values were obtained without the mutations’ frequency information.
Comparison Between UMD‐Predictor and Other Prediction Tools for VCF Processing Using Three Files That Contained 58,145, 54,006, and 57,936 Variants, Respectively
| SIFT | PPH2 | Provean | Mutation assessor | CONDEL | Mutation Taster | CADD | UMD‐Predictor | |
|---|---|---|---|---|---|---|---|---|
| PT1 (s) | 1,200 | 420 | 3,240 | 540 | 3,000 | 2,100 | 8,700 | 93 |
| PT2 (s) | 240 | 420 | 8,100 | 960 | 1,500 | 2,340 | 9,360 | 206 |
| PT3 (s) | 540 | 420 | 4,140 | 600 | 1,500 | 2,340 | 11,160 | 240 |
| NV1 | 1,958 | 2,881 | 1,540 | 1,339 | 1,376 | 2,677 | 3,241 | 871 |
| NV2 | 1,341 | 2,350 | 1,332 | 1,049 | 1,111 | 2,437 | 2,555 | 540 |
| NV3 | 1,842 | 2,781 | 1,542 | 1,350 | 1,376 | 3,401 | 3,098 | 807 |
All tests have been performed using the Web interface of each system. PT1‐3, processing time in seconds for VCF files 1–3; NV1‐3, number of variants predicted to be pathogenic for VCF files 1–3.
A preprocessing of the VCF file is required before analysis (this time has not been included in the tests).
VCF analysis was not possible on‐line; therefore, data have been generated through downloaded data.
VCF files cannot include more than 1,000 rows; therefore, the original VCF files have been split (this time has not been included in the tests).