| Literature DB >> 26982818 |
Hongyi Zhou1, Mu Gao1, Jeffrey Skolnick1.
Abstract
The advance of next-generation sequencing technologies has made exome sequencing rapid and relatively inexpensive. A major application of exome sequencing is the identification of genetic variations likely to cause Mendelian diseases. This requires processing large amounts of sequence information and therefore computational approaches that can accurately and efficiently identify the subset of disease-associated variations are needed. The accuracy and high false positive rates of existing computational tools leave much room for improvement. Here, we develop a boosted tree regression machine-learning approach to predict human disease-associated amino acid variations by utilizing a comprehensive combination of protein sequence and structure features. On comparing our method, ENTPRISE, to the state-of-the-art methods SIFT, PolyPhen-2, MUTATIONASSESSOR, MUTATIONTASTER, FATHMM, ENTPRISE exhibits significant improvement. In particular, on a testing dataset consisting of only proteins with balanced disease-associated and neutral variations defined as having the ratio of neutral/disease-associated variations between 0.3 and 3, the Mathews Correlation Coefficient by ENTPRISE is 0.493 as compared to 0.432 by PPH2-HumVar, 0.406 by SIFT, 0.403 by MUTATIONASSESSOR, 0.402 by PPH2-HumDiv, 0.305 by MUTATIONTASTER, and 0.181 by FATHMM. ENTPRISE is then applied to nucleic acid binding proteins in the human proteome. Disease-associated predictions are shown to be highly correlated with the number of protein-protein interactions. Both these predictions and the ENTPRISE server are freely available for academic users as a web service at http://cssb.biology.gatech.edu/entprise/.Entities:
Mesh:
Year: 2016 PMID: 26982818 PMCID: PMC4794227 DOI: 10.1371/journal.pone.0150965
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Protein sequence parsing procedure.
The lower (upper) bound is defined as the aligned position of the template’s N-terminal (C-terminal) in the query sequence.
Summary of features.
| Number of features | Description |
|---|---|
| 1 | Entropy derived from multiple sequence alignment |
| 20 | Wild-type amino acid type |
| 20 | Mutant amino acid type |
| 20 | Domain composition of 20 amino acid types |
| 20 | Contacting composition—composition of 20 amino acid types whose Cα atoms are within a 12 Å distance of the mutated Cα atom position |
Summary of various implementations of ENTPRISE.
| Method | Feature type | |||
|---|---|---|---|---|
| amino acid type | domain composition | contacting composition (default cutoff = 12Å) | entropy | |
| ENTPRISE | √ | √ | √ | √ |
| ENTPRISE_CUT8 | √ | √ | √ (cutoff = 8Å) | √ |
| ENTPRISE_NOTYP | √ | √ | √ | |
| ENTPRISE_NODOM | √ | √ | √ | |
| ENTPRISE_NOCNT | √ | √ | √ | |
| ENTPRISE_NOENT | √ | √ | √ | |
| ENTPRISE_ONLYDOM | √ | |||
a Symbol √ means feature type is used.
Summary of variations in ENTPRISE datasets.
| Set | Disease-associated | Neutral | Total | ||||
|---|---|---|---|---|---|---|---|
| In balanced protein | In unbalanced protein | Sub-total | In balanced protein | In unbalanced protein | Sub-total | ||
| ENTPRISE-TR | 4,727 (751) | 9,770 (707) | 14,497 | 4,489 (751) | 27,674 (8,866) | 32,163 | 46,659 (9,728) |
| ENTPRISE-TE | 4,501 (761) | 9,942 (700) | 14,443 | 4,406 (761) | 27,725 (8,847) | 32,131 | 46,574 (9,725) |
| ENTPRISE-balance | 4,501 (761) | - | 4,501 | 4,406 (761) | - | 4,406 | 8,907 (761) |
| Duplicate sampled ENTPRISE-TR | 28,362 | 9,770 | 38,132 | 26,934 | 27,673 | 54,607 | 92,739 |
a Balanced protein is defined as a protein whose ratio of neutral/disease-associated variations is between 0.3 and 3.0. Unbalanced protein has its ratio out of the above range. Numbers in parenthesis are numbers of proteins.
Performance of different methods on the ENTPRISE-TE set.
| Method | Evaluated variations | MCC | ACC | Sen | Spe | PPV | NPV | AUC |
|---|---|---|---|---|---|---|---|---|
| SIFT | 40,120 | 0.395 | 0.674 | 0.815 | 0.598 | 0.520 | 0.858 | 0.786 |
| PPH2-HumDiv | 40,317 | 0.374 | 0.646 | 0.846 | 0.539 | 0.496 | 0.867 | 0.771 |
| PPH2-HumVar | 40,317 | 0.423 | 0.700 | 0.796 | 0.648 | 0.548 | 0.855 | 0.793 |
| MUTATIONASSESSOR | 39,758 | 0.417 | 0.710 | 0.744 | 0.692 | 0.564 | 0.834 | 0.788 |
| MUTATIONTASTER | 40,286 | 0.337 | 0.600 | 0.447 | 0.462 | 0.879 | 0.695 | |
| FATHMM | 39,741 | 0.538 | 0.764 | 0.838 | 0.724 | 0.623 | 0.891 | 0.866 |
| ENTPRISE(0.45) | 46,574 | 0.768 | ||||||
| ENTPRISE(0.55) | 46,574 | 0.646 | 0.854 | 0.668 | 0.937 | 0.827 | 0.863 | 0.907 |
| ENTPRISE(0.20) | 46,574 | 0.488 | 0.688 | 0.947 | 0.572 | 0.499 | 0.960 | 0.907 |
| ENTPRISE_CUT8 | 46,574 | 0.651 | 0.850 | 0.764 | 0.889 | 0.755 | 0.893 | 0.908 |
| ENTPRISE_NOTYP | 46,574 | 0.645 | 0.850 | 0.738 | 0.900 | 0.768 | 0.884 | 0.906 |
| ENTPRISE_NODOM | 46,574 | 0.464 | 0.763 | 0.674 | 0.803 | 0.606 | 0.846 | 0.812 |
| ENTPRISE_NOCNT | 46,574 | 0.680 | 0.865 | 0.747 | 0.918 | 0.805 | 0.890 | 0.919 |
| ENTPRISE_NOENT | 46,574 | 0.645 | 0.851 | 0.716 | 0.912 | 0.786 | 0.877 | 0.903 |
| ENTPRISE_ONLYDOM | 46,574 | 0.691 | 0.860 | 0.855 | 0.862 | 0.736 | 0.930 | 0.913 |
a To be fair to other methods, only their overlapped variations with those of ENTPRISE are evaluated.
b ENTPRISE with alternative score cutoffs 0.20, 0.45(default), 0.55.
Performance of different methods on the ENTPRISE-balance set.
| Method | Evaluated variations | MCC | ACC | Sen | Spe | PPV | NPV | AUC |
|---|---|---|---|---|---|---|---|---|
| SIFT | 8,438 | 0.406 | 0.699 | 0.826 | 0.565 | 0.668 | 0.754 | 0.775 |
| PPH2-HumDiv | 8,522 | 0.402 | 0.693 | 0.863 | 0.511 | 0.652 | 0.771 | |
| PPH2-HumVar | 8,522 | 0.432 | 0.714 | 0.807 | 0.616 | 0.691 | 0.750 | 0.786 |
| MUTATIONASSESSOR | 8,455 | 0.403 | 0.702 | 0.748 | 0.653 | 0.696 | 0.709 | 0.765 |
| MUTATIONTASTER | 8,517 | 0.305 | 0.638 | 0.375 | 0.601 | 0.756 | 0.662 | |
| FATHMM | 8,459 | 0.181 | 0.592 | 0.693 | 0.484 | 0.588 | 0.598 | 0.637 |
| ENTPRISE(0.45) | 8,907 | 0.669 | 0.708 | |||||
| ENTPRISE(0.55) | 8,907 | 0.472 | 0.721 | 0.551 | 0.893 | 0.841 | 0.661 | 0.818 |
| ENTPRISE(0.20) | 8,907 | 0.394 | 0.677 | 0.906 | 0.443 | 0.624 | 0.821 | 0.818 |
| ENTPRISE_CUT8 | 8,907 | 0.493 | 0.742 | 0.660 | 0.827 | 0.795 | 0.704 | 0.815 |
| ENTPRISE_NOTYP | 8,907 | 0.437 | 0.713 | 0.611 | 0.817 | 0.773 | 0.673 | 0.788 |
| ENTPRISE_NODOM | 8,907 | 0.453 | 0.725 | 0.678 | 0.773 | 0.753 | 0.702 | 0.792 |
| ENTPRISE_NOCNT | 8,907 | 0.469 | 0.727 | 0.607 | 0.848 | 0.803 | 0.679 | 0.811 |
| ENTPRISE_NOENT | 8,907 | 0.401 | 0.691 | 0.554 | 0.832 | 0.771 | 0.646 | 0.779 |
| ENTPRISE_ONLYDOM | 8,907 | 0.254 | 0.626 | 0.709 | 0.542 | 0.612 | 0.645 | 0.670 |
a To be fair to other methods, only their overlapped variations with those of ENTPRISE are evaluated.
b ENTPRISE with alternative score cutoffs 0.20,0.45(default),0.55.
Fig 2Receiver operating characteristic curves of methods ENTPRISE, SIFT, PP2-HUMDIV, PPH2-HUMVAR, MUTATIONASSESSOR, MUTATIONTASTER & FATHMM for the ENTPRISE-balance set.
Performance of different methods on the 1000 Genome & VariSNP sets.
| 1k-Genome | VariSNP | |||
|---|---|---|---|---|
| Method | Evaluated variations | False positive rate | Evaluated variations | False positive rate |
| SIFT | 151,182 | 42.6% | 70,430 | 38.7% |
| PPH2-HumDiv | 151,981 | 48.2% | 70,758 | 44.2% |
| PPH2-HumVar | 151,981 | 36.4% | 70,758 | 31.6% |
| MUTATIONASSESSOR | 149,248 | 29.3% | 69,520 | 26.1% |
| MUTATIONTASTER | 151,830 | 51.5% | 70,645 | 46.4% |
| FATHMM | 143,032 | 12.9% | 67,433 | 14.2% |
| ENTPRISE(0.45) | 162,249 | 61,215 | ||
| ENTPRISE(0.55) | 162,249 | 5.4% | 61,215 | 4.4% |
| ENTPRISE(0.20) | 162,249 | 42.0% | 61,215 | 39.4% |
| ENTPRISE_CUT8 | 162,249 | 10.1% | 61,215 | 8.5% |
| ENTPRISE_NOTYP | 162,249 | 8.1% | 61,215 | 6.9% |
| ENTPRISE_NODOM | 162,249 | 20.5% | 61,215 | 18.5% |
| ENTPRISE_NOCNT | 162,249 | 6.4% | 61,215 | 5.3% |
| ENTPRISE_NOENT | 162,249 | 6.7% | 61,215 | 5.7% |
| ENTPRISE_ONLYDOM | 162,249 | 2.6% | 61,215 | 3.0% |
a To be fair to other methods, only overlapping variations with ENTPRISE are evaluated.
b Number of input variations: 74,837.
c ENTPRISE with alternative score cutoffs 0.20,0.45(default),0.55.
Performance of ENTPRISE for predicting cancer driver missense mutations in the COSMIC set.
| Method | MCC | ACC | Sen | Spe | AUC | False positive rate = 1-Spe |
|---|---|---|---|---|---|---|
| ENTPRISE | 0.724 | 0.853 | 0.745 | 0.961 | 0.914 | 3.9% |
| CHASM | 0.79 | 0.89 | 0.79 | 0.99 | 0.92 | 1.0% |
| MUTATIONASSESSOR | 0.62 | 0.81 | 0.76 | 0.86 | 0.89 | 14% |
| Condel | 0.58 | 0.78 | 0.75 | 0.82 | 0.85 | 18%- |
| PolyPhen-2 | 0.54 | 0.77 | 0.79 | 0.75 | 0.82 | 25% |
| SIFT | 0.52 | 0.76 | 0.70 | 0.82 | 0.80 | 18% |
| SNAP | 0.37 | 0.68 | 0.55 | 0.81 | 0.67 | 19% |
| mCluster | 0.35 | 0.65 | 0.40 | 0.90 | 0.64 | 10% |
| logRE | 0.22 | 0.61 | 0.65 | 0.57 | 0.60 | 43% |
a Results of other methods are taken from Ref.[53] Table 1.
Performance of ENTPRISE for predicting cancer driver missense mutations in the TCGA set.
| Method | MCC | ACC | Sen | Spe | AUC | False positive rate = 1-Spe |
|---|---|---|---|---|---|---|
| ENTPRISE | 0.098 | 0.530 | 0.133 | 0.927 | 0.588 | 7.3% |
| CHASM | 0.05 | 0.50 | 0.0 | 1.00 | 0.34 | 0% |
| MUTATIONASSESSOR | 0.49 | 0.74 | 0.86 | 0.62 | 0.79 | 38% |
| Condel | 0.37 | 0.68 | 0.66 | 0.66 | 0.72 | 34% |
| PolyPhen-2 | 0.34 | 0.66 | 0.76 | 0.56 | 0.68 | 44% |
| SIFT | 0.30 | 0.65 | 0.74 | 0.56 | 0.66 | 44% |
| SNAP | 0.26 | 0.62 | 0.43 | 0.79 | 0.59 | 21% |
| mCluster | 0.17 | 0.54 | 0.08 | 0.99 | 0.50 | 1% |
| logRE | 0.07 | 0.52 | 0.39 | 0.64 | 0.50 | 36% |
a Results of other methods are taken from Ref.[53] Table 2.
Performance of ENTPRISE for predicting cancer driver missense mutations in the COBR set.
| Method | MCC | ACC | Sen | Spe | AUC | False positive rate = 1-Spe |
|---|---|---|---|---|---|---|
| ENTPRISE | 0.198 | 0.562 | 0.171 | 0.952 | 0.585 | 4.8% |
| CHASM | 0.08 | 0.50 | 0.0 | 1.0 | 0.36 | 0% |
| MUTATIONASSESSOR | 0.46 | 0.70 | 0.91 | 0.50 | 0.74 | 50% |
| Condel | 0.33 | 0.66 | 0.66 | 0.66 | 0.68 | 34% |
| PolyPhen-2 | 0.30 | 0.65 | 0.63 | 0.66 | 0.63 | 34% |
| SIFT | 0.29 | 0.64 | 0.73 | 0.55 | 0.63 | 45% |
| SNAP | 0.26 | 0.62 | 0.45 | 0.78 | 0.59 | 22% |
| mCluster | 0.0 | 0.50 | 0.0 | 1.0 | 0.46 | 0% |
| logRE | 0.08 | 0.54 | 0.44 | 0.64 | 0.53 | 36% |
a Results of other methods are taken from Ref.[53] Table 3.
Fig 3(a) Normalized distribution of proteins versus the fraction of positions having all variations predicted to be disease-associated; (b) Average number of protein-protein interactions versus the fraction of positions having all variations predicted to be disease-associated ≥ the given threshold value.
Fig 4Cumulative fraction of proteins predicted to be disease-associated versus the fraction of positions having all variations predicted to be disease-associated ≥ the given threshold value.
Fig 5Number of disease-associated proteins predicted by ENTPRISE, SIFT, PPH2_HumDiv for variations from 10 patients.