| Literature DB >> 29723276 |
Hongyi Zhou1, Mu Gao1, Jeffrey Skolnick1.
Abstract
To exploit the plethora of information provided by Next Generation Sequencing, the identification of the genetic mutations responsible for disease in general or cancer in particular, among the thousands of neutral germline or somatic variations is a crucial task. Genome-wide association studies for the detection of disease-associated genes or cancer drivers can only identify common variations or driver genes in a cohort of patients. Thus, they cannot discover unique disease-associated mutations or cancer driver genes on a personal basis. Moreover, even when there are such common variations, their significance is unknown. Here, we extend the machine learning based approach ENTPRISE developed for predicting the disease association of missense mutations to frameshift and nonsense mutations. The new approach, ENTPRISE-X, is shown to outperform the state-of-the-art methods VEST-indel and DDIG-in for predicting the disease association of germline frameshift mutations in terms of balanced measure Matthew's correlation coefficient, MCC, with a MCC of 0.586 for ENTPRISE-X, versus 0.412 by VEST-indel and 0.321 by DDIG-in, respectively. Large scale testing on the ExAC dataset shows ENTPRISE-X has a much lower fraction of 16% of variations classified as disease causing, as compared to VEST-indel's 26% and DDIG-in's 65% of predictions as being disease-associated. A web server for ENTPRISE-X is freely available for academic users at http://cssb2.biology.gatech.edu/entprise-x.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29723276 PMCID: PMC5933770 DOI: 10.1371/journal.pone.0196849
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of variations in the ENTPRISE-X training and testing data sets.
| 1. For training a model in future applications. | ||||
| Pathogenic | Neutral | Pathogenic | Neutral | |
| ClinVar: 6,513 | ESP6500: 1,604 | ClinVar: 5,023 | ESP6500: 181 | |
| 1000 GP: 366 | 1000 GP: 3,171 | |||
| Total numbers (sum of each column) | ||||
| 6,513 | 1,970 | 5,023 | 32,51 | |
| For test on frameshift variations in comparison to VEST-indel & DDIG-in methods (see | ||||
| Pathogenic | Neutral | Pathogenic | Neutral | |
| ClinVar: 82 | Inter-species: 1,025 | ─ | ─ | |
| ExAC set | For large scale false positive rate test on frameshift & nonsense variations in comparison to the VEST-indel & DDIG-in methods | |||
| ─ | ExAC: 56,917 | ─ | ExAC: 45,131 | |
Performance on the VEST-indel test set for frameshift variations.
| Method | MCC | Sensitivity | Specificity | F-score | False positive rate | False discovery rate |
|---|---|---|---|---|---|---|
| ENTPRISE-X | 0.626 | 0.943 | 0.916 | 0.620 | 8.4% | 54% |
| VEST-indel | 0.440 | 0.914 | 0.814 | 0.421 | 18.6% | 73% |
| DDIG-in | 0.321 | 0.943 | 0.663 | 0.297 | 33.7% | 82% |
| ENTPRISE-X | 0.586 | 0.878 | 0.912 | 0.590 | 8.8% | 55% |
| Baseline | 0.323 | 0.988 | 0.621 | 0.294 | 37.9% | 83% |
| Baseline | 0.224 | 0.598 | 0.775 | 0.271 | 22.5% | 83% |
| ENTPRISE-X_1 | 0.570 | 0.878 | 0.905 | 0.574 | 9.5% | 57% |
| ENTPRISE-X_2 | 0.555 | 0.854 | 0.905 | 0.562 | 9.5% | 58% |
| ENTPRISE-X_10alt | 0.587±0.006 | 0.887±0.006 | 0.910±0.003 | 0.590±0.006 | 9.0%±0.3% | 55.8%±0.7% |
| ENTPRISE-X-nolocal | 0.481 | 0.707 | 0.914 | 0.509 | 8.6% | 60% |
| ENTPRISE-X-nonew | 0.099 | 0.793 | 0.390 | 0.168 | 61.0% | 90% |
| ENTPRISE-X-noratio | 0.513 | 0.890 | 0.871 | 0.509 | 12.9% | 64% |
| ENTPRISE-X-noessential | 0.574 | 0.890 | 0.903 | 0.575 | 9.7% | 58% |
| ENTPRISE-X-nopathogen | 0.543 | 0.866 | 0.896 | 0.546 | 10.4% | 60% |
| ENTPRISE-X-nodisease | 0.368 | 0.683 | 0.859 | 0.396 | 14.1% | 72% |
| ENTPRISE-X-nointeract | 0.586 | 0.890 | 0.909 | 0.588 | 9.1% | 56% |
a To be fair to all methods, only the consensus mutations of three methods are evaluated in comparison to the other methods.
b Matthew’s Correlation Coefficient. The numbers in parentheses are the maximal possible values.
c 2(precision×recall)/(precision+recall), where precision = (true positive)/(true positive + false positive), recall = (true positive)/(true positive + false negative). Numbers in parentheses are the maximal possible values.
d When only the feature representing if the gene is disease-associated or not is used.
e When only the feature representing if the gene is essential or not is used.
f When using one of the 2 models trained on each half of the pathogenic data and training ENTPRISE-X for 10 different random partitions of the pathogenic part of the training set were used.
Fig 1Receiver operating characteristic curves of ENTPRISE-X, VEST-indel and DDIG-in.
Fig 2False discovery rate by the ENTPRISE-X and VEST-indel methods at various cutoffs on the VEST-indel test set.
Ten-fold cross-validation of ENTPRISE-X on the whole training set.
| Variation type | MCC | Sensitivity | Specificity | F-score | False positive rate | False discovery rate |
|---|---|---|---|---|---|---|
| Frameshift & nonsense | 0.655 | 0.851 | 0.815 | 0.871 | 18.5% | 10.8% |
| Frameshift | 0.616 | 0.871 | 0.815 | 0.909 | 18.5% | 4.9% |
| Nonsense | 0.619 | 0.806 | 0.815 | 0.789 | 18.5% | 22.8% |
| Frameshift & nonsense | 0.156 | 0.755 | 0.393 | 0.721 | 60.7% | 31.0% |
| Frameshift | 0.059 | 0.727 | 0.341 | 0.771 | 65.9% | 17.8% |
| Nonsense | 0.253 | 0.821 | 0.415 | 0.638 | 58.5% | 47.9% |
| Frameshift | 0.201 | 0.837 | 0.367 | 0.842 | 63.3% | 15.3% |
a To be fair to all methods, only the consensus mutations of the compared methods are evaluated.
Comparison of the percentage of disease causing variations in the ExAC set.
| Method | Frameshift | Nonsense |
|---|---|---|
| Evaluated #: 48123 | Evaluated #: 40482 | |
| ENTPRISE-X | 16.5%(0.73:6.0%) | 15.7%(0.73:5.6%) |
| VEST-indel | 26.2%(0.82:9.5%) | - |
| DDIG-in | 64.4%(0.81:33.3%) | 65.4%(0.81:35.1%) |
a To be fair to all methods, only the consensus mutations of three methods are reported. Numbers in parenthesis are cutoff:false positive rate using the cutoff that maximizes the MCC in Table 2.
Fig 3Distribution of ENTPRISE-X scores for the neutral variations in the VEST-indel test set.
The area under the curve is normalized to one.
Fig 4P-value derived from the VEST-indel test set by fitting to an extreme value distribution versus ENTPRISE-X score cutoffs.
Summary of patient annotations using ENTPRISE-X.
| patient | # of annotated variations | # of annotated genes | # of disease associated genes | false discovery rate |
|---|---|---|---|---|
| 1 | 396/337 | 274/201 | 40/77 | 0.267/0.880 |
| 2 | 473/437 | 313/245 | 53/119 | 0.230/0.694 |
| 3 | 561/629 | 341/297 | 72/182 | 0.185/0.550 |
| 4 | 417/395 | 270/207 | 40/94 | 0.263/0.742 |
| 5 | 397/393 | 262/190 | 38/86 | 0.269/0.745 |
| 6 | 410/377 | 271/226 | 34/105 | 0.311/0.725 |
| 7 | 454/340 | 284/198 | 34/79 | 0.326/0.845 |
| 8 | 380/352 | 261/205 | 39/97 | 0.261/0.712 |
| 9 | 431/ 414 | 278/227 | 48/96 | 0.226/0.797 |
| 10 | 402/372 | 255/206 | 47/94 | 0.212/0.739 |
| 11 | 494/451 | 319/256 | 51/123 | 0.244/0.701 |
| 12 | 572/588 | 353/299 | 69/184 | 0.200/0.548 |
| 13 | 547/580 | 342/301 | 67/166 | 0.199/0.611 |
a In each cell, first number is from ENTPRISE-X, second number is from DDIG-in method.
Fig 5Dependence of average false discovery rate of ENTPRISE-X for 13 human exomes on score cutoffs.