| Literature DB >> 25250338 |
Abstract
Identifying cancer-associated mutations (driver mutations) is critical for understanding the cellular function of cancer genome that leads to activation of oncogenes or inactivation of tumor suppressor genes. Many approaches are proposed which use supervised machine learning techniques for prediction with features obtained by some databases. However, often we do not know which features are important for driver mutations prediction. In this study, we propose a novel feature selection method (called DX) from 126 candidate features' set. In order to obtain the best performance, rotation forest algorithm was adopted to perform the experiment. On the train dataset which was collected from COSMIC and Swiss-Prot databases, we are able to obtain high prediction performance with 88.03% accuracy, 93.9% precision, and 81.35% recall when the 11 top-ranked features were used. Comparison with other various techniques in the TP53, EGFR, and Cosmic2plus datasets shows the generality of our method.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25250338 PMCID: PMC4163459 DOI: 10.1155/2014/905951
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Six groups of 20 amino acids.
| Group 1 | Group 2 | Group 3 | Group 4 | Group 5 | Group 6 |
|---|---|---|---|---|---|
| D, E, N, Q | H, R, K | C | S, T, P, A, G | M, I, L, V | F, Y, W |
Figure 1The accuracy of two classifiers by adding features sequentially using 5-fold cross-validation.
Figure 2Bar plots to show the feature distribution for the optimal features. Blue denotes that the distribution of DX-RF: 0 derived from amino acid residue change features (AARC), 0 derived from substitution scoring matrix features (SSM), 7 derived from protein sequence-specific features (PSS) and 4 derived from annotated features (AF).
The performance of two classifiers on the training dataset.
| Method | Precision | Recall |
| Accuracy | MCC | ROC area |
|---|---|---|---|---|---|---|
| DX-RF | 0.939 | 0.8135 | 0.8717 | 0.88028 | 0.7674 | 0.9353 |
| Variance | 0.003 | 0.0022 | 0.0015 | 0.0014 | 0.003 | 0.0014 |
| mRMR-RF | 0.9277 | 0.8294 | 0.8758 | 0.8824 | 0.7691 | 0.9429 |
| Variance | 0.0026 | 0.0044 | 0.0022 | 0.0018 | 0.0034 | 0.0013 |
Performance of predicting on three test datasets (TP53, EGFR, and Cosmic2plus).
| Method | Test set | Accuracy | Recall | Precision |
| MCC |
|---|---|---|---|---|---|---|
| mRMR-RF | TP53 + neutral | 88.86 | 100 | 62.4 | 76.85 | 0.734 |
| EGFR + neutral | 86.68 | 100 | 15.88 | 27.41 | 0.3702 | |
| Cosmic2plus + neutral | 85.3 | 81.04 | 59.26 | 68.46 | 0.6041 | |
|
| ||||||
| DX-LibSVM | TP53 + neutral | 83.93 | 100 | 53.48 | 69.69 | 0.6553 |
| EGFR + neutral | 80.78 | 100 | 11.56 | 20.73 | 0.3047 | |
| Cosmic2plus + neutral | 81.51 | 86.52 | 51.83 | 64.83 | 0.5655 | |
|
| ||||||
| DX-SVMLight | TP53 + neutral | 88.31 | 100 | 61.25 | 75.97 | 0.7243 |
| EGFR + neutral | 86.02 | 100 | 15.23 | 26.44 | 0.3612 | |
| Cosmic2plus + neutral | 85.42 | 84.46 | 59.08 | 69.53 | 0.6199 | |
|
| ||||||
| DX-RF | TP53 + neutral |
|
|
|
|
|
| EGFR + neutral |
|
|
|
|
| |
| Cosmic2plus + neutral |
|
|
|
|
| |
The detailed information of the four classifiers.
| Method | Dataset | TP | FP | TN | FN |
|---|---|---|---|---|---|
| mRMR-RF | TP53 | 1029 | 620 | 3919 | 0 |
| EGFR | 117 | 620 | 3919 | 0 | |
| Cosmic2plus | 902 | 620 | 3919 | 211 | |
|
| |||||
| DX-SVMLight | TP53 | 1029 | 651 | 3888 | 0 |
| EGFR | 117 | 651 | 3888 | 0 | |
| Cosmic2plus | 940 | 651 | 3888 | 173 | |
|
| |||||
| DX-LibSVM | TP53 | 1029 | 895 | 3644 | 0 |
| EGFR | 117 | 895 | 3644 | 0 | |
| Cosmic2plus | 963 | 895 | 3644 | 150 | |
|
| |||||
| DX-RF | TP53 | 1029 | 597 | 3942 | 0 |
| EGFR | 117 | 597 | 3942 | 0 | |
| Cosmic2plus | 892 | 597 | 3942 | 221 | |