| Literature DB >> 20003388 |
Marcin J Mizianty1, Lukasz Kurgan.
Abstract
BACKGROUND: Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences.Entities:
Mesh:
Substances:
Year: 2009 PMID: 20003388 PMCID: PMC2805645 DOI: 10.1186/1471-2105-10-414
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Cartoon structures of proteins that cover the seven structural classes defined in the SCOP database. Panel a shows structure of protein with PDB identifier 1mty, b for 1a8d, c for 2f62, d for 2bf5, e for 1vqq, f for 1u7g, and g for 4hir. Helices are shown in light grey, coils in dark gray, and strands in black.
Figure 2Distribution of sequences with respect to their maximal pairwise sequence identity in the D498 dataset.
Figure 3Diagram of the proposed MODAS method.
The property groups used to aggregate similar amino acids.
| R groups | Electronic groups | ||
|---|---|---|---|
| Non-polar aliphatic | A, I, L, V | Donors | A, D, E, P |
| Glycine | G | Weak donors | I, L, V |
| Non-polar | F, M, P, W | Acceptors | K, N, R |
| Polar uncharged | C, N, Q, S, T, Y | Weak acceptors | F, M, Q, T, Y |
| Polar charged | D, E, H, K, R | Neutral | C, G, H, S, W |
| Hydrophobic | A, C, F, I, L, M, P, V, W, Y | Group 1 | H, R, K, |
| Group 2 | D, E, N, Q, | ||
| Group 3 | C | ||
| Hydrophilic | D, E, G, H, K, N, Q, R, S, T | Group 4 | S, T, P, A, G |
| Group 5 | M, I, L, V, | ||
| Group 6 | F, Y, W | ||
Results of the feature and classifier selection for the considered seven structural classes.
| Class | Kernel | C | Feature selection method | # of selected features |
|---|---|---|---|---|
| all-α | RBF (γ = 0.05) | 10 | Wrapper with SVM | 117 |
| all-β | RBF (γ = 0.1) | 7 | Wrapper with NB | 53 |
| α/β | Polynomial (exp = 2) | 2 | ReliefF | 46 |
| α+β | RBF (γ = 0.15) | 4 | CFS | 163 |
| Multi-domain | Polynomial (exp = 1.5) | 0.5 | Wrapper with SVM | 105 |
| Membrane | Polynomial (exp = 1.5) | 10 | Symmetrical Uncertainty | 46 |
| Small | Polynomial (exp = 2.5) | 15 | Wrapper with NB | 18 |
Number of features selected for each structural class for different categories of features.
| Class | AA sequence | PSSM | PSSM and predicted secondary structure | Predicted secondary structure | Collocations of H and E segments | Total |
|---|---|---|---|---|---|---|
| α | 8 | 26 | 52 | 21 | 10 | 117 |
| β | 2 | 28 | 9 | 8 | 6 | 53 |
| α/β | 0 | 0 | 17 | 17 | 12 | 46 |
| α+β | 2 | 11 | 101 | 27 | 22 | 163 |
| Multi-domain | 3 | 17 | 43 | 26 | 16 | 105 |
| Membrane | 6 | 16 | 18 | 6 | 0 | 46 |
| Small | 6 | 6 | 3 | 3 | 0 | 18 |
Number of the selected features for the features computed from the predicted secondary structure.
| Class | AA+PSSM | Secondary structure (including PSSM) | Collocations of helical and strand segments | ||||
|---|---|---|---|---|---|---|---|
| α | 34 | 22 | 17 | 34 | 8 | 1 | 1 |
| β | 30 | 12 | 3 | 2 | 2 | 2 | 2 |
| α/β | 0 | 1 | 20 | 13 | 2 | 1 | 9 |
| α+β | 13 | 5 | 50 | 73 | 18 | 4 | 0 |
| Multi-domain | 20 | 17 | 24 | 28 | 8 | 5 | 3 |
| Membrane | 22 | 0 | 24 | 0 | 0 | 0 | 0 |
| Small | 12 | 2 | 2 | 2 | 0 | 0 | 0 |
Figure 4Scatter plots for two representative features for each structural class (left column) and helix and strand contents (right column) for a) all-α; b) all-β; c) α/β; d) α+β; e) multi-domain; f) membrane and cell surface proteins; and g) small proteins classes. The plots were computed on the ASTRALtrainingdataset and they use markers with colors and shapes that indicate the class and number of protein chains for a given combination of the values of the two features, respectively. The larger the marker is the more chains are found for the corresponding values of the two features. The darker the shading of the marker is the larger the fraction of the chains that correspond to the target class is for the given values of the two features.
Experimental results for the test on the independent ASTRALtrainingdataset for the proposed MODAS method that considers the 4 major structural classes, 6 classes excluding the small proteins class, and all 7 considered classes.
| # of | Accuracy | MCC | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 94.06 | 83.38 | 85.01 | 71.47 | 83.01 | 0.92 | 0.79 | 0.78 | 0.61 | 0.63 | ||||||
| 6 | 93.28 | 82.63 | 82.20 | 71.07 | 26.42 | 57.97 | 80.24 | 0.90 | 0.78 | 0.75 | 0.61 | 0.22 | 0.74 | 0.49 | ||
| 7 | 91.72 | 82.18 | 82.20 | 70.29 | 26.42 | 57.97 | 84.26 | 79.89 | 0.89 | 0.78 | 0.76 | 0.60 | 0.22 | 0.75 | 0.84 | 0.52 |
Results of the experimental comparison of the proposed MODAS method and the competing SCEC and SCPRED methods on the ASTRALtestdataset with the four major structural classes.
| Algorithm | Training dataset | Accuracy | MCC | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| MODAS with 4 classes | ASTRALtraining | 0.61 | 0.63 | ||||||||
| MODAS with 6 classes | ASTRALtraining | 93.27 | 82.73 | 82.31 | 71.07 | 81.84 | 0.77 | ||||
| MODAS with 7 classes | ASTRALtraining | 91.71 | 82.27 | 82.31 | 70.29 | 81.17 | 0.91 | 0.79 | 0.77 | 0.61 | |
| SCPRED | ASTRALtraining | 93.13 | 78.33 | 83.38 | 64.27 | 79.14 | 0.77 | 0.70 | 0.52 | 0.57 | |
| SCPRED | 25PDB | 92.81 | 79.09 | 80.05 | 63.74 | 78.36 | 0.78 | 0.67 | 0.51 | 0.56 | |
| SCEC | Web server | 75.74 | 72.73 | 78.42 | 28.14 | 62.80 | 0.65 | 0.55 | 0.59 | 0.22 | 0.29 |
The MODAS method was used to make predictions for all 7, 6 (excluding the small proteins class), and the 4 major classes. The SCEC method was trained on the ASTRALtraining with the proteins from the 4 major classes (this method can handle only prediction of the 4 classes) and on the 25PDB dataset based on results in [53]. The SCEC predictions were generated using the web server at http://biomine.ece.ualberta.ca/Structural_Class/SCEC.html. Bold font indicates the best results.
Results of the experimental comparison between the proposed MODAS method and competing structural class prediction methods on the 25PDB dataset.
| Classifier used (name of the method, if any) | Feature vector (# features) | Reference | Accuracy | MCC | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SVM with 1st order polyn. kernel | autocorrelation (30) | 73 | 50.1 | 49.4 | 28.8 | 29.5 | 34.2 | 0.16 | 0.16 | 0.05 | 0.05 | 0.02 |
| Multinomial logistic regression | custom dipeptides (16) | 58 | 56.2 | 44.5 | 41.3 | 18.8 | 40.2 | 0.23 | 0.20 | 0.31 | 0.06 | 0.05 |
| Bagging with random tree | CV (20) | 54 | 58.7 | 47.0 | 35.5 | 24.7 | 41.8 | 0.33 | 0.26 | 0.22 | 0.06 | 0.06 |
| Information discrepancy | tripeptides (8000) | 59, 60 | 45.8 | 48.5 | 51.7 | 32.5 | 44.7 | 0.39 | 0.39 | 0.25 | 0.06 | 0.11 |
| LogicBoost with decision tree | CV (20) | 46 | 56.9 | 51.5 | 45.4 | 30.2 | 46.0 | 0.41 | 0.32 | 0.32 | 0.06 | 0.10 |
| Information discrepancy | dipeptides (400) | 59, 60 | 59.6 | 54.2 | 47.1 | 23.5 | 47.0 | 0.46 | 0.40 | 0.24 | 0.04 | 0.12 |
| LogitBoost with decision stump | CV (20) | 54 | 62.8 | 52.6 | 50.0 | 32.4 | 49.4 | 0.49 | 0.35 | 0.34 | 0.11 | 0.13 |
| SVM with 3rd order polyn. kernel | CV (20) | 54 | 61.2 | 53.5 | 57.2 | 27.7 | 49.5 | 0.46 | 0.35 | 0.39 | 0.11 | 0.13 |
| SVM with Gaussian kernel | CV (20) | 47 | 68.6 | 59.6 | 59.8 | 28.6 | 53.9 | 0.52 | 0.42 | 0.43 | 0.15 | 0.17 |
| Multinomial logistic regression | custom (66) | 73 | 69.1 | 61.6 | 60.1 | 38.3 | 57.1 | 0.56 | 0.44 | 0.48 | 0.21 | 0.21 |
| Nearest neighbor | Composition of tripeptides (8000) | 52 | 60.6 | 60.7 | 67.9 | 44.3 | 58.6 | --- | --- | --- | --- | --- |
| SVM with RBF kernel | custom (34) | 72 | 69.7 | 62.1 | 67.1 | 39.3 | 59.5 | 0.60 | 0.50 | 0.53 | 0.21 | 0.25 |
| Multinomial logistic regression | custom (34) | 72 | 71.1 | 65.3 | 66.5 | 37.3 | 60.0 | 0.61 | 0.51 | 0.51 | 0.22 | 0.25 |
| StackingC ensemble | custom (34) | 72 | 74.6 | 67.9 | 70.2 | 32.4 | 61.3 | 0.62 | 0.53 | 0.55 | 0.22 | 0.26 |
| Linear logistic regression | custom (58) | 30 | 75.2 | 67.5 | 62.1 | 44.0 | 62.2 | 0.63 | 0.54 | 0.54 | 0.27 | 0.27 |
| SVM with 1st order polyn. kernel | custom (58) | 30 | 77.4 | 66.4 | 61.3 | 45.4 | 62.7 | 0.65 | 0.54 | 0.55 | 0.27 | 0.28 |
| SVM with RBF kernel | custom (56) | 61 | 76.5 | 67.3 | 66.8 | 45.8 | 64.0 | 0.62 | 0.51 | 0.50 | 0.28 | --- |
| Discriminant analysis | custom (16) | 78 | 64.3 | 65.0 | 61.7 | 65.0 | 64.0 | --- | --- | --- | --- | --- |
| SVM with Gaussian kernel | custom (8 PSI Pred based) | 79 | 80.6 | 73.4 | 68.5 | 79.1 | 0.87 | 0.67 | 0.54 | 0.54 | ||
| SVM with Gaussian kernel | PSI Pred based (13) | 79 | 79.8 | 74.9 | 69.0 | 79.3 | 0.87 | 0.68 | 0.55 | 0.55 | ||
| SVM with RBF kernel (SCPRED) | custom (9) | 79 | 80.1 | 74.0 | 79.7 | 0.87 | 0.69 | 0.57 | 0.55 | |||
| SVM with polynomial or RBF kernels (MODAS) | custom(117, 53, 46, 163) | this paper | 92.3 | 68.3 | ||||||||
The results were obtained using jackknife test. The methods are ordered by their average accuracies which coincide with the GC2 scores. Best results are shown in bold and "---" indicates results that were not reported by the original authors and which cannot be duplicated.
Results of the experimental comparison between the proposed MODAS method and competing structural class prediction methods on the D1189 dataset.
| Classifier used (name of the method, if any) | Feature vector | Reference | Accuracy | ||||
|---|---|---|---|---|---|---|---|
| α | β | α/β | α+β | overall | |||
| SVM | AA composition, autocorrelations, and physicochemical properties | 73 | - | - | - | - | 52.1 |
| Bayesian classifier | AA composition | 81 | 54.8 | 57.1 | 75.2 | 22.2 | 53.8 |
| Logistic regression | AA composition, autocorrelations, and physicochemical properties | 73 | 60.2 | 60.5 | 55.2 | 33.2 | 53.9 |
| SVM | AA and polypeptide composition, physicochemical properties | 45 | - | - | - | - | 54.7 |
| Nearest neighbor | Pseudo-amino acid composition | 67 | 48.9 | 59.5 | 81.7 | 26.6 | 56.9 |
| Ensemble | AA composition, autocorrelations, and physicochemical properties | 72 | - | - | - | - | 58.9 |
| Nearest neighbor | Composition of tripeptides | 52 | - | - | - | - | 59.9 |
| IB1 | PSI Blast based collocated AA pairs | 75 | 65.3 | 67.7 | 79.9 | 40.7 | 64.7 |
| Discriminant analysis | custom | 78 | 62.3 | 67.7 | 63.1 | 66.5 | 65.2 |
| SVM with RBF kernel (SCEC) | PSI Blast based collocated AA pairs | 75 | 75.8 | 75.2 | 82.6 | 31.8 | 67.6 |
| SVM with RBF kernel (SCPRED) | custom | 79 | 89.1 | 86.7 | 53.8 | 80.6 | |
| SVM with polynomial or RBF kernels (MODAS) | custom | this paper | 87.9 | ||||
The results were obtained using jackknife test. The methods are ordered by their average accuracies. Best results are shown in bold and "---" indicates results that were not reported by the original authors and which cannot be duplicated.
Results of the experimental comparison between the proposed MODAS method and competing structural class prediction methods on the D675 dataset.
| Classifier used (name of the method, if any) | Feature vector | Reference | Accuracy | ||||
|---|---|---|---|---|---|---|---|
| α | β | α/β | α+β | overall | |||
| Bayesian classifier | AA composition | 81 | 53.5 | 42.3 | 68.3 | 28.3 | 48.0 |
| IB1 | PSI Blast based collocated AA pairs | 75 | 54.9 | 47.4 | 68.9 | 35.0 | 51.5 |
| SVM with RBF kernel (SCEC) | PSI Blast based collocated AA pairs | 75 | 74.3 | 59.6 | 79.7 | 34.5 | 61.5 |
| SVM with RBF kernel (SCPRED) | custom | 79 | 89.1 | 58.2 | 79.5 | ||
| SVM with polynomial or RBF kernels (MODAS) | custom | this paper | 84.2 | ||||
The results were obtained using jackknife test. The methods are ordered by their average accuracies. Best results are shown in bold.
Results of the experimental comparison between the proposed MODAS method and competing structural class prediction methods on the D498 dataset.
| Classifier used (name of the method, if any) | Feature vector | Reference | Accuracy | ||||
|---|---|---|---|---|---|---|---|
| α | β | α/β | α+β | Avg | |||
| Component-coupling | AA composition | 70 | 93.5 | 88.9 | 90.4 | 84.5 | 89.2 |
| Neural network | AA composition | 80 | 86.0 | 96.0 | 88.2 | 86.0 | 89.2 |
| Rough sets | AA composition and physicochemical properties | 49 | 87.9 | 91.3 | 86.0 | 90.8 | |
| SVM with RBF kernel (SCPRED) | custom | 79 | 94.9 | 91.7 | 94.2 | 86.1 | 91.5 |
| SVM | AA composition | 82 | 88.8 | 95.2 | 96.3 | 91.5 | 93.2 |
| Fuzzy k-nearest neighbor algorithm | protein sequence | 68 | 95.3 | 93.7 | 97.8 | 88.3 | 93.8 |
| Nearest Neighbor (NN-CDM) | protein sequence | 69 | 96.3 | 93.7 | 95.6 | 89.9 | 93.8 |
| LogitBoost | AA composition | 71 | 92.5 | 96.0 | 97.1 | 93.0 | 94.8 |
| SVM with RBF kernel (SCEC) | PSI-BLAST based p-collocated AA pairs | 75 | 93.3 | 95.6 | 93.4 | 94.9 | |
| IB1 | PSI-BLAST based p-collocated AA pairs | 75 | 95.0 | 95.8 | 97.8 | 94.2 | 95.7 |
| SVM with polynomial or RBF kernels (MODAS) | custom | this paper | 96.7 | 95.6 | |||
The results were obtained using jackknife test. The methods are ordered by their average accuracies. Best results are shown in bold.
Results of the experimental comparison between the proposed MODAS and PseAA methods on the D2230 dataset.
| MODAS | 90.6 | 78.9 | 85.2 | 70.6 | 33.3 | 45.5 | 85.2 | 80.0 | 0.88 | 0.77 | 0.75 | 0.61 | 0.34 | 0.55 | 0.87 | 0.49 |
| PseAA | --- | --- | --- | --- | --- | --- | --- | 57.4 | --- | --- | --- | --- | --- | --- | --- | --- |
The results were obtained using jackknife test. "---" indicates results that were not reported by the original authors and which cannot be duplicated.