| Literature DB >> 19840377 |
Ayşegül Ozen1, Mehmet Gönen, Ethem Alpaydan, Türkan Haliloğlu.
Abstract
BACKGROUND: Computational prediction of protein stability change due to single-site amino acid substitutions is of interest in protein design and analysis. We consider the following four ways to improve the performance of the currently available predictors: (1) We include additional sequence- and structure-based features, namely, the amino acid substitution likelihoods, the equilibrium fluctuations of the alpha- and beta-carbon atoms, and the packing density. (2) By implementing different machine learning integration approaches, we combine information from different features or representations. (3) We compare classification vs. regression methods to predict the sign vs. the output of stability change. (4) We allow a reject option for doubtful cases where the risk of misclassification is high.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19840377 PMCID: PMC2777163 DOI: 10.1186/1472-6807-9-66
Source DB: PubMed Journal: BMC Struct Biol ISSN: 1472-6807
Representations, original features, and the new features.
| S | ± 3 neighbors (± 3 N | S | P |
| T | Mutation (M | T | P |
| S | ± 3 neighbors (± 3 N | S | P |
In all three representations, amino acid substitution likelihood is used as a feature. B-factors of the C and Catoms and spatial neighbor determined using both C and Catoms are features introduced into TO and ST. The abbreviations are given only for the features that we add.
The list of 34 possible input feature sets.
| 1 | S | - | - | - | - |
| 2 | S | + | - | - | - |
| 3 | T | - | - | - | - |
| 4 | T | + | - | - | - |
| 5 | T | - | + | - | - |
| 6 | T | - | - | + | - |
| 7 | T | - | - | - | + |
| 8 | T | + | + | - | - |
| 9 | T | + | - | + | - |
| 10 | T | + | - | - | + |
| 11 | T | - | + | + | - |
| 12 | T | - | + | - | + |
| 13 | T | - | - | + | + |
| 14 | T | + | + | + | - |
| 15 | T | + | + | - | + |
| 16 | T | + | - | + | + |
| 17 | T | - | + | + | + |
| 18 | T | + | + | + | + |
| 19 | S | - | - | - | - |
| 20 | S | + | - | - | - |
| 21 | S | - | + | - | - |
| 22 | S | - | - | + | - |
| 23 | S | - | - | - | + |
| 24 | S | + | + | - | - |
| 25 | S | + | - | + | - |
| 26 | S | + | - | - | + |
| 27 | S | - | + | + | - |
| 28 | S | - | + | - | + |
| 29 | S | - | - | + | + |
| 30 | S | + | + | + | - |
| 31 | S | + | + | - | + |
| 32 | S | + | - | + | + |
| 33 | S | - | + | + | + |
| 34 | S | + | + | + | + |
The new features to each of the three representations (SO, TO or ST) are added one at a time and as combinations of two and more. The original features are already given in Table 1 and are not shown here.
Figure 1Distribution of S2783 data over the free energy change due to single-site mutation, ΔΔ. The regions separated by dashed lines are used to obtain similar training and test splits. Random one-third of the instances in each region is reserved for testing and the remaining two-third is used in training.
Performance evaluation measures.
| Accuracy | |
| Error Rate | |
| Precision | |
| Recall | |
| FP Rate | |
Risk matrix.
| + | 0 | 1 | ||
| - | 0 | 1 | ||
The algorithm to select the classifiers to be combined.
| 1: | Initialize the subset |
| 2: | Initialize the subset |
| 3: | Remove the most accurate ( |
| 4: | Perform McNemar's test for all pairs between |
| 5: | Decrease the degree of confidence, |
| 6: | |
| 7: | Select the most accurate and most diverse ( |
| 8: | Go to |
| 9: | |
| 10: | Use the ( |
| 11: |
The aim is to select the most accurate and at the same time the most diverse classifiers.
Figure 2Accuracy of the best (R.D.B) triplets in early integration for each representation of the S1615 data set. Effect of adding each extra feature to the set of original features is observed by adding each, one at a time, and combinations of two or more. SVM is the best classifier for all representations.
Early integration results for S1615 data set.
| 0.814 | 0.778 | 0.752 | 0.703 | 0.838 | 0.904 | |
| 0.812 | 0.781 | 0.766 | 0.702 | 0.839 | 0.904 | |
| 0.812 | 0.819 | 0.770 | 0.739 | 0.822 | 0.905 | |
| 0.817 | 0.844 | 0.788 | 0.756 | 0.825 | 0.909 | |
| 0.814 | 0.777 | 0.771 | 0.734 | 0.838 | 0.904 | |
| 0.817 | 0.775 | 0.800 | 0.729 | 0.842 | 0.904 | |
The accuracy of each base-learner trained with original data and with extra features added in SO/SO*, TO/TO* or ST/ST*. The values reported for each classifier are respectively the validation and test accuracies of the original representation and the new representation.
The precision, recall, and FP rates of the most accurate classifiers on the test set in early integration for S1615 data set.
| 0.711 | 0.800 | 0.702 | |
| 0.284 | 0.282 | 0.284 | |
| 0.015 | 0.009 | 0.016 | |
Performance of late integration of the four triplets (ST.PAMCB.SVM), (TO.BFA.SVM), (ST.CBBFB.DT), and (TO.CBBFABFB.k-NN) for S1615 data set.
| 0.847 | 0.903 | |
| 0.819 | 0.694 | |
| 0.677 | 0.284 | |
| 0.071 | 0.017 |
McNemar's test results for the triplets (ST.PAMCB.SVM), (TO.BFA.SVM), (ST.CBBFB.DT), and (TO.CBBFABFB.k-NN) for S1615 data set.
| 11.72 | 66.61 | 154.10 | |
| 42.12 | 135.64 | ||
| 41.32 | |||
Multikernel SVM test results as intermediate integration for S1615 data set.
| 0.872 | 0.381 | 0.176 | 0.038 | |
| 0.872 | 0.381 | 0.176 | 0.038 | |
| 0.833 | 0.343 | 0.485 | 0.122 | |
| 0.879 | 0.459 | 0.258 | 0.040 | |
| 0.818 | 0.311 | 0.470 | 0.137 | |
| 0.878 | 0.448 | 0.252 | 0.041 | |
The combination weights obtained for the original and modified features for S1615 data set.
| (0.19)1 | |
| (0.19)1 | |
| (0.36)M | |
| (0.21)M | |
| (0.04)1 | |
| (0.03)1 | |
Comparison of best of three integration methods for S1615 data set.
| (S | (S | (T | |
| 0.842 ± 0.047 | 0.847 ± 0.046 | 0.826 ± 0.044 | |
| 0.904 ± 0.004 | 0.903 ± 0.005 | 0.879 ± 0.006 |
early = late > intermediate according to paired t-test
Early integration results for S2783 data set.
| 0.795 | 0.794 | 0.748 | 0.762 | 0.829 | 0.832 | 0.825 | 0.828 | |
| 0.793 | 0.794 | 0.751 | 0.756 | 0.829 | 0.829 | 0.824 | 0.827 | |
| 0.804 | 0.803 | 0.762 | 0.769 | 0.821 | 0.824 | 0.813 | 0.818 | |
| 0.806 | 0.799 | 0.770 | 0.780 | 0.826 | 0.829 | 0.818 | 0.824 | |
| 0.797 | 0.797 | 0.758 | 0.766 | 0.829 | 0.831 | 0.825 | 0.828 | |
| 0.798 | 0.797 | 0.766 | 0.782 | 0.829 | 0.830 | 0.825 | 0.828 | |
The accuracy of each base-learner trained with original data and with extra features added in SO/SO*, TO/TO* or ST/ST*. The values reported for each classifier and regressor are respectively the validation and test accuracies of the original representation and the new representation.
The precision, recall, and FP rates of the most accurate classifiers and regressors on the test set in early integration for S2783 data set.
| 0.790 | 0.807 | 0.784 | 0.854 | 0.868 | 0.855 | |
| 0.612 | 0.579 | 0.614 | 0.527 | 0.501 | 0.529 | |
| 0.072 | 0.061 | 0.075 | 0.040 | 0.034 | 0.040 | |
Performance of late integration for S2783 data set.
| 0.830 | 0.832 | 0.819 | 0.827 | |
| 0.795 | 0.790 | 0.853 | 0.858 | |
| 0.604 | 0.615 | 0.495 | 0.520 | |
| 0.071 | 0.073 | 0.038 | 0.038 | |
Multikernel SVM and SVR test results as intermediate integration for S2783 data set.
| 0.800 | 0.716 | 0.589 | 0.107 | 0.789 | 0.688 | 0.570 | 0.115 | |
| 0.799 | 0.708 | 0.604 | 0.114 | 0.790 | 0.692 | 0.569 | 0.113 | |
| 0.805 | 0.710 | 0.621 | 0.113 | 0.797 | 0.705 | 0.580 | 0.107 | |
| 0.802 | 0.697 | 0.629 | 0.122 | 0.792 | 0.677 | 0.611 | 0.129 | |
| 0.806 | 0.705 | 0.636 | 0.119 | 0.793 | 0.681 | 0.607 | 0.126 | |
| 0.804 | 0.700 | 0.633 | 0.121 | 0.789 | 0.671 | 0.610 | 0.132 | |
The combination weights obtained with SVM for the original and modified features for S2783 data set.
| (0.19)1 | |
| (0.19)1 | |
| (0.19)M | |
| (0.21)M | |
| (0.04)1 | |
| (0.02)1 | |
The combination weights obtained with SVR for the original and modified features for S2783 data set.
| (0.15)1 | |
| (0.16)1 | |
| (0.25)M | |
| (0.28)M | |
| (0.02)1 | |
| (0.01)1 | |
Performance measures of SVM early integration (SO) for S2783 data set with reject option.
| 2 | 1 | 0.829 | 0.793 | 0.602 | 0.071 | 0.000 | 0.831 | 0.788 | 0.615 | 0.073 | 0.000 |
| 2 | 2 | 0.834 | 0.813 | 0.582 | 0.059 | 0.024 | 0.839 | 0.816 | 0.596 | 0.058 | 0.025 |
| 2 | 5 | 0.840 | 0.839 | 0.544 | 0.043 | 0.059 | 0.845 | 0.844 | 0.560 | 0.042 | 0.060 |
| 5 | 1 | 0.842 | 0.815 | 0.599 | 0.058 | 0.064 | 0.847 | 0.821 | 0.615 | 0.057 | 0.066 |
| 5 | 2 | 0.848 | 0.839 | 0.569 | 0.044 | 0.092 | 0.852 | 0.844 | 0.587 | 0.044 | 0.094 |
| 5 | 5 | 0.854 | 0.871 | 0.531 | 0.029 | 0.122 | 0.857 | 0.874 | 0.545 | 0.030 | 0.127 |
| 10 | 1 | 0.884 | 0.839 | 0.735 | 0.058 | 0.298 | 0.884 | 0.844 | 0.743 | 0.058 | 0.303 |
| 10 | 2 | 0.891 | 0.863 | 0.712 | 0.043 | 0.322 | 0.892 | 0.870 | 0.717 | 0.042 | 0.329 |
| 10 | 5 | 0.897 | 0.863 | 0.621 | 0.028 | 0.364 | 0.894 | 0.885 | 0.620 | 0.031 | 0.371 |
Performance measures of SVR early integration (SO) for S2783 data set with reject option.
| 2 | 1 | 0.835 | 0.862 | 0.538 | 0.038 | 0.020 | 0.836 | 0.859 | 0.544 | 0.039 | 0.019 |
| 2 | 2 | 0.838 | 0.883 | 0.519 | 0.030 | 0.038 | 0.839 | 0.878 | 0.526 | 0.031 | 0.036 |
| 2 | 5 | 0.839 | 0.947 | 0.280 | 0.003 | 0.147 | 0.839 | 0.963 | 0.278 | 0.004 | 0.149 |
| 5 | 1 | 0.931 | 0.894 | 0.887 | 0.051 | 0.513 | 0.926 | 0.886 | 0.880 | 0.054 | 0.516 |
| 5 | 2 | 0.952 | 0.947 | 0.750 | 0.007 | 0.608 | 0.947 | 0.963 | 0.743 | 0.010 | 0.612 |
| 5 | 5 | 0.954 | 0.857 | 0.571 | 0.001 | 0.640 | 0.949 | 0.997 | 0.530 | 0.000 | 0.646 |
| 10 | 1 | 0.966 | 0.947 | 0.827 | 0.008 | 0.656 | 0.961 | 0.963 | 0.826 | 0.010 | 0.658 |
| 10 | 2 | 0.968 | 0.913 | 0.749 | 0.003 | 0.677 | 0.963 | 0.976 | 0.749 | 0.006 | 0.678 |
| 10 | 5 | 0.969 | 0.775 | 0.586 | 0.000 | 0.695 | 0.965 | 0.999 | 0.566 | 0.000 | 0.699 |
Performance measures of SVM late integration for S2783 data set with reject option.
| 2 | 1 | 0.829 | 0.792 | 0.606 | 0.072 | 0.000 | 0.831 | 0.787 | 0.617 | 0.074 | 0.000 |
| 2 | 2 | 0.833 | 0.806 | 0.597 | 0.064 | 0.012 | 0.836 | 0.804 | 0.609 | 0.065 | 0.013 |
| 2 | 5 | 0.838 | 0.825 | 0.581 | 0.054 | 0.031 | 0.840 | 0.820 | 0.594 | 0.056 | 0.030 |
| 5 | 1 | 0.839 | 0.812 | 0.609 | 0.062 | 0.032 | 0.841 | 0.808 | 0.621 | 0.064 | 0.035 |
| 5 | 2 | 0.842 | 0.825 | 0.595 | 0.055 | 0.048 | 0.844 | 0.820 | 0.610 | 0.057 | 0.049 |
| 5 | 5 | 0.847 | 0.845 | 0.566 | 0.043 | 0.073 | 0.849 | 0.844 | 0.582 | 0.045 | 0.076 |
| 10 | 1 | 0.849 | 0.825 | 0.618 | 0.056 | 0.072 | 0.851 | 0.820 | 0.634 | 0.058 | 0.075 |
| 10 | 2 | 0.852 | 0.837 | 0.599 | 0.048 | 0.089 | 0.856 | 0.837 | 0.617 | 0.049 | 0.094 |
| 10 | 5 | 0.859 | 0.836 | 0.492 | 0.027 | 0.147 | 0.861 | 0.866 | 0.507 | 0.029 | 0.154 |
Performance measures of SVR late integration for S2783 data set with reject option.
| 2 | 1 | 0.828 | 0.862 | 0.512 | 0.036 | 0.021 | 0.834 | 0.862 | 0.532 | 0.037 | 0.019 |
| 2 | 2 | 0.834 | 0.907 | 0.472 | 0.021 | 0.053 | 0.837 | 0.897 | 0.484 | 0.023 | 0.056 |
| 2 | 5 | 0.833 | 0.961 | 0.327 | 0.005 | 0.123 | 0.838 | 0.963 | 0.328 | 0.005 | 0.130 |
| 5 | 1 | 0.938 | 0.916 | 0.856 | 0.032 | 0.474 | 0.940 | 0.909 | 0.865 | 0.033 | 0.477 |
| 5 | 2 | 0.949 | 0.961 | 0.770 | 0.009 | 0.535 | 0.952 | 0.963 | 0.779 | 0.009 | 0.541 |
| 5 | 5 | 0.952 | 0.937 | 0.620 | 0.001 | 0.570 | 0.953 | 0.979 | 0.630 | 0.003 | 0.575 |
| 10 | 1 | 0.966 | 0.961 | 0.872 | 0.010 | 0.604 | 0.965 | 0.963 | 0.860 | 0.010 | 0.606 |
| 10 | 2 | 0.970 | 0.936 | 0.757 | 0.001 | 0.638 | 0.966 | 0.977 | 0.748 | 0.004 | 0.638 |
| 10 | 5 | 0.970 | 0.860 | 0.660 | 0.000 | 0.652 | 0.967 | 0.984 | 0.675 | 0.002 | 0.652 |
Performance measures of SVM intermediate integration (TO*) for S2783 data set with reject option.
| 2 | 1 | 0.807 | 0.712 | 0.632 | 0.116 | 0.000 | 0.802 | 0.693 | 0.636 | 0.124 | 0.000 |
| 2 | 2 | 0.823 | 0.755 | 0.598 | 0.083 | 0.051 | 0.822 | 0.749 | 0.603 | 0.086 | 0.055 |
| 2 | 5 | 0.836 | 0.802 | 0.540 | 0.053 | 0.108 | 0.837 | 0.804 | 0.548 | 0.052 | 0.112 |
| 5 | 1 | 0.851 | 0.769 | 0.678 | 0.082 | 0.165 | 0.849 | 0.765 | 0.679 | 0.084 | 0.168 |
| 5 | 2 | 0.862 | 0.802 | 0.636 | 0.058 | 0.207 | 0.862 | 0.804 | 0.639 | 0.058 | 0.211 |
| 5 | 5 | 0.874 | 0.725 | 0.357 | 0.022 | 0.309 | 0.872 | 0.796 | 0.351 | 0.021 | 0.318 |
| 10 | 1 | 0.878 | 0.802 | 0.717 | 0.066 | 0.300 | 0.879 | 0.804 | 0.728 | 0.065 | 0.308 |
| 10 | 2 | 0.891 | 0.787 | 0.550 | 0.034 | 0.372 | 0.892 | 0.813 | 0.562 | 0.034 | 0.383 |
| 10 | 5 | 0.898 | 0.399 | 0.224 | 0.013 | 0.436 | 0.899 | 0.579 | 0.244 | 0.012 | 0.445 |
Performance measures of SVR intermediate integration (TO) for S2783 data set with reject option.
| 2 | 1 | 0.810 | 0.723 | 0.595 | 0.099 | 0.035 | 0.808 | 0.715 | 0.596 | 0.102 | 0.037 |
| 2 | 2 | 0.839 | 0.843 | 0.405 | 0.026 | 0.177 | 0.840 | 0.855 | 0.393 | 0.024 | 0.185 |
| 2 | 5 | 0.843 | 0.589 | 0.101 | 0.001 | 0.258 | 0.841 | 0.946 | 0.103 | 0.003 | 0.260 |
| 5 | 1 | 0.918 | 0.887 | 0.629 | 0.021 | 0.505 | 0.917 | 0.889 | 0.615 | 0.022 | 0.515 |
| 5 | 2 | 0.926 | 0.589 | 0.275 | 0.002 | 0.555 | 0.924 | 0.946 | 0.268 | 0.004 | 0.562 |
| 5 | 5 | 0.926 | 0.310 | 0.127 | 0.000 | 0.565 | 0.925 | 0.877 | 0.125 | 0.001 | 0.573 |
| 10 | 1 | 0.961 | 0.589 | 0.436 | 0.003 | 0.698 | 0.952 | 0.946 | 0.458 | 0.005 | 0.697 |
| 10 | 2 | 0.963 | 0.310 | 0.200 | 0.000 | 0.708 | 0.953 | 0.877 | 0.256 | 0.001 | 0.708 |
| 10 | 5 | 0.963 | 0.310 | 0.200 | 0.000 | 0.708 | 0.953 | 0.877 | 0.256 | 0.001 | 0.708 |
Figure 3.
Figure 4.
Figure 5Distribution of the correctly classified (grey) and misclassified (black) instances of S1615 data set after the S. Misclassified instances are clustered mainly around zero. In the regions {-∞, -4} and {3, ∞} all instances are correctly classified.
Comparison of our results for S1615 data set with previously published studies.
| [ | SVM | 2048 | 0.77 (20-fold cv) | Seq |
| [ | SVM | 1383 * | 0.73 (20-fold cv) | Seq |
| [ | NN | 1615 | 0.79 (20-fold cv) | Seq+Str |
| [ | SVM | 1496‡ | S | Seq+Str |
| [ | iPTREE | 1615 | 0.87 (10-fold cv) | Seq+Str |
| Ours | Early | 1122 (training) | Seq+Str | |
*Filtered from the set of 2048 mutations [41].
† A subset of the training set that was previously used in training.
‡ Filtered from the set of 1615 mutations [9].
Machine learning method, data set, performance assessment are the main features to be compared. (Seq: Sequence-based information, Seq+Str: Sequence- and structure-based information)