| Literature DB >> 18452616 |
Lukasz Kurgan1, Krzysztof Cios, Ke Chen.
Abstract
BACKGROUND: Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18452616 PMCID: PMC2391167 DOI: 10.1186/1471-2105-9-226
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Rules for assignment of structural classes based on the content of the corresponding secondary structures.
| reference | structural class | α-helix amount | β-strand amount | additional constrains |
| [13] | all-α | ≥ 40% | ≤ 5% | |
| all-β | ≤ 5% | ≥ 40% | ||
| α+β | ≥ 15% | ≥ 15% | more than 60% antiparallel β-sheets | |
| α/β | ≥ 15% | ≥ 15% | more than 60% parallel β-sheets | |
| [14] | all-α | > 15% | < 10% | |
| all-β | < 15% | > 10% | ||
| mixed | > 15% | > 10% | ||
Experimental comparison between SCPRED and competing structural class prediction methods.
| Test type | Algorithm | Feature vector (# features) | Reference | Accuracy | MCC | GC2 | |||||||
| all-α | all-β | α/β | α+β | overall | all-α | all-β | α/β | α+β | |||||
| Jackknife | SVM (Gaussian kernel) | CV (20) | [36] | 68.6 | 59.6 | 59.8 | 28.6 | 0.52 | 0.42 | 0.43 | 0.15 | ||
| LogicBoost with decision tree | CV (20) | [23] | 56.9 | 51.5 | 45.4 | 30.2 | 0.41 | 0.32 | 0.32 | 0.06 | |||
| Bagging with random tree | CV (20) | [34] | 58.7 | 47.0 | 35.5 | 24.7 | 0.33 | 0.26 | 0.22 | 0.06 | |||
| LogitBoost with decision stump | CV (20) | 62.8 | 52.6 | 50.0 | 32.4 | 0.49 | 0.35 | 0.34 | 0.11 | ||||
| SVM (3rd order polyn. kernel) | CV (20) | 61.2 | 53.5 | 57.2 | 27.7 | 0.46 | 0.35 | 0.39 | 0.11 | ||||
| Multinomial logistic regression | custom dipeptides (16) | [28] | 56.2 | 44.5 | 41.3 | 18.8 | 0.23 | 0.20 | 0.31 | 0.06 | |||
| Information discrepancy1 | dipeptides (400) | [22, 24] | 59.6 | 54.2 | 47.1 | 23.5 | 0.46 | 0.40 | 0.24 | 0.04 | |||
| Information discrepancy1 | tripeptides (8000) | 45.8 | 48.5 | 51.7 | 32.5 | 0.39 | 0.39 | 0.25 | 0.06 | ||||
| Multinomial logistic regression | custom (34) | [27] | 71.1 | 65.3 | 66.5 | 37.3 | 0.61 | 0.51 | 0.51 | 0.22 | |||
| SVM with RBF kernel | custom (34) | 69.7 | 62.1 | 67.1 | 39.3 | 0.60 | 0.50 | 0.53 | 0.21 | ||||
| StackingC ensemble | custom (34) | 74.6 | 67.9 | 70.2 | 32.4 | 0.62 | 0.53 | 0.55 | 0.22 | ||||
| Multinomial logistic regression | custom (66) | [26] | 69.1 | 61.6 | 60.1 | 38.3 | 0.56 | 0.44 | 0.48 | 0.21 | |||
| SVM (1st order polyn. kernel) | autocorrelation (30) | 50.1 | 49.4 | 28.8 | 29.5 | 0.16 | 0.16 | 0.05 | 0.05 | ||||
| SVM (1st order polyn. kernel) | custom (58) | [29] | 77.4 | 66.4 | 61.3 | 45.4 | 0.65 | 0.54 | 0.55 | 0.27 | |||
| Linear logistic regression | custom (58) | 75.2 | 67.5 | 62.1 | 44.0 | 0.63 | 0.54 | 0.54 | 0.27 | ||||
| SVM (Gaussian kernel) | PSI-PRED based (13) | this paper | 92.6 | 79.8 | 74.9 | 69.0 | 0.87 | 0.79 | 0.68 | 0.55 | |||
| SVM (Gaussian kernel) | custom (8 PSI-PRED based) | this paper | 92.6 | 80.6 | 73.4 | 68.5 | 0.87 | 0.79 | 0.67 | 0.54 | |||
| custom (9) | this paper | 92.6 | 80.1 | 74.0 | 71.0 | 0.87 | 0.79 | 0.69 | 0.57 | ||||
| 10-fold cross validation | SVM (Gaussian kernel) | CV (20) | [36] | 67.9 | 59.1 | 58.1 | 27.7 | 0.51 | 0.42 | 0.41 | 0.14 | ||
| LogicBoost with decision tree | CV (20) | [23] | 51.9 | 53.7 | 46.5 | 32.4 | 0.38 | 0.37 | 0.31 | 0.07 | |||
| Bagging with random tree | CV (20) | [34] | 53.5 | 51.0 | 37.6 | 22.0 | 0.28 | 0.30 | 0.22 | 0.04 | |||
| LogitBoost with decision stump | CV (20) | 63.2 | 53.5 | 50.9 | 32.4 | 0.48 | 0.36 | 0.36 | 0.12 | ||||
| SVM (3rd order polyn. kernel) | CV (20) | 61.4 | 54.0 | 55.2 | 27.4 | 0.46 | 0.35 | 0.37 | 0.10 | ||||
| Multinomial logistic regression | custom dipeptides (16) | [28] | 56.9 | 44.2 | 42.2 | 17.7 | 0.24 | 0.20 | 0.32 | 0.04 | |||
| Multinomial logistic regression | custom (34) | [27] | 69.9 | 65.3 | 66.5 | 38.4 | 0.60 | 0.52 | 0.51 | 0.23 | |||
| SVM with RBF kernel | custom (34) | 70.2 | 61.6 | 67.6 | 39.6 | 0.60 | 0.49 | 0.53 | 0.22 | ||||
| StackingC ensemble | custom (34) | 73.4 | 67.3 | 69.1 | 29.8 | 0.59 | 0.52 | 0.54 | 0.18 | ||||
| Multinomial logistic regression | custom (66) | [26] | 69.1 | 60.5 | 59.5 | 38.1 | 0.56 | 0.44 | 0.48 | 0.20 | |||
| SVM (1st order polyn. kernel) | autocorrelation (30) | 52.4 | 49.7 | 0.3 | 30.4 | 0.18 | 0.16 | 0.05 | 0.06 | ||||
| SVM (1st order polyn. kernel) | custom (58) | [29] | 77.7 | 66.8 | 60.7 | 45.4 | 0.64 | 0.54 | 0.54 | 0.28 | |||
| Linear logistic regression | custom (58) | 74.7 | 66.4 | 62.7 | 45.8 | 0.63 | 0.54 | 0.54 | 0.27 | ||||
| SVM (Gaussian kernel) | PSI-PRED based (13) | this paper | 93.2 | 79.5 | 75.7 | 69.4 | 0.87 | 0.79 | 0.70 | 0.55 | |||
| SVM (Gaussian kernel) | custom (8 PSI-PRED based) | this paper | 92.5 | 80.4 | 73.7 | 68.0 | 0.87 | 0.79 | 0.67 | 0.54 | |||
| custom (9) | this paper | 92.8 | 80.6 | 74.3 | 71.4 | 0.87 | 0.79 | 0.70 | 0.57 | ||||
1This method was not originally tested using 10-fold cross validation and thus we also did not report these results
Experimental comparison between SCPRED and structural class assignment methods based on the secondary structure predicted with PSI-PRED.
| Prediction/assignment method | Accuracy | |||
| all-α | all-β | mixed | overall | |
| [13] | 78.8 | 30.2 | 66.7 | |
| [14] | 91.6 | 73.1 | 86.8 | |
| SCPRED (10-fold cross validation) | 92.8 | 80.6 | 89.2 | |
| SCPRED (jackknife) | 92.6 | 80.1 | 88.9 | |
Comparison of accuracy when predicting the structural classes using all features, each feature individually, and when excluding one features at the time.
| Features | Accuracy when predicting with one feature | |||||
| all-α | all-β | α/β | α+β | overall | ||
| All features included | 92.8 | 80.6 | 74.3 | 71.4 | 80.1 | |
| using only one feature | PSIPRED- | 58.7 | 32.9 | 46.7 | 58.4 | |
| 81.9 | 53.8 | |||||
| PSIPRED- | 76.3 | 74.5 | 48.8 | 64.3 | ||
| PSIPRED- | 49.9 | 0.0 | 47.8 | 47.9 | ||
| PSIPRED- | 85.8 | 59.1 | 50.9 | 47.4 | 61.4 | |
| 71.3 | 51.0 | |||||
| PSIPRED- | 83.1 | 48.8 | 0.0 | 52.6 | ||
| PSIPRED- | 79.2 | 33.9 | 3.2 | 42.4 | 41.8 | |
| 73.8 | 0.0 | 54.3 | 7.7 | 32.8 | ||
| excluding the listed feature | PSIPRED- | 92.1 | 79.5 | 71.7 | 70.8 | 78.9 |
| PSIPRED- | 93.0 | 79.5 | 73.1 | 70.3 | 79.3 | |
| PSIPRED- | 92.5 | 80.8 | 72.5 | 71.0 | 79.6 | |
| PSIPRED- | 92.5 | 81.0 | 71.4 | 68.7 | 78.8 | |
| PSIPRED- | 90.7 | 80.1 | 73.4 | 71.4 | 79.3 | |
| PSIPRED- | 91.9 | 80.6 | 72.5 | 71.4 | 79.5 | |
| PSIPRED- | 92.8 | 79.7 | 73.1 | 69.8 | 79.2 | |
| PSIPRED- | 92.3 | 80.6 | 71.1 | 69.2 | 78.7 | |
| 92.8 | 80.6 | 73.4 | 68.9 | 79.3 | ||
Bold font shows the top two highest accuracies when using individual features, and features selected for further analysis.
Figure 1Scatter plots of PSIPRED-. Top-left plot corresponds to sequences belonging to all-α class, top-right for all-β class, bottom-left for α/β, and bottom-right got α+β class.
Comparisons of accuracies obtained by PFRES, PFP and coupled PFRES+SCPRED and PFP+SCPRED methods on FC699 dataset.
| Entire FC699 dataset | Only kept sequences | Only removed sequences | |||||
| PFRES | SCPRED | PFRES + SCPRED | Coverage (% kept) | ||||
| fold | class | class | fold | class | fold | class | |
| 65.6 | 92.1 | 87.5 | 68.6 | 96.7 | 45.7 | 62.8 | 86.6% |
| PFP | SCPRED | PFP + SCPRED | Coverage (% kept) | ||||
| fold | class | class | fold | class | fold | class | |
| 30.9 | 65.8 | 87.5 | 47.3 | 97.0 | 3.8 | 14.1 | 62.4% |
The "Entire FC699 dataset" column shows accuracies for PFRES, SCPRED and PFP methods for class/fold prediction on the FC699 dataset. The "Only kept sequences" column show accuracies obtained by the PFRES and PFP methods for sequences for which SCPRED predicted the same structural class as PFRES and PFP, respectively. The "Only removed sequences" column show accuracies obtained by the PFRES and PFP methods for sequences for which SCPRED predicted different structural class when compared with predictions of PFRES and PFP, respectively. The "Coverage" column shows the percentage of sequences for which the SCPRED and PFRES/PFP predicted the same structural class.
The values of AA indices that include average isoelectric point pI, Fauchere-Pliska's (FH) and the Eisenberg's (EH) hydrophobicity indices, and relative side-chain mass (M) and hydropathy (Hp) indices.
| Name | Code | Index | Physicochemical index | ||||
| Alanine | A | 1 | 6.01 | 0.42 | 0.62 | 1.8 | 0.115 |
| Cysteine | C | 2 | 5.07 | 1.34 | 0.29 | 2.5 | 0.777 |
| Aspartate | D | 3 | 2.77 | -1.05 | -0.9 | -3.5 | 0.446 |
| Glutamate | E | 4 | 3.22 | -0.87 | -0.74 | -3.5 | 0.446 |
| Phenylalanine | F | 5 | 5.48 | 2.44 | 1.19 | 2.8 | 0.36 |
| Glycine | G | 6 | 5.97 | 0 | 0.48 | -0.4 | 0.55 |
| Histidine | H | 7 | 7.59 | 0.18 | -0.4 | -3.2 | 0.55 |
| Isoleucine | I | 8 | 6.02 | 2.46 | 1.38 | 4.5 | 0.00076 |
| Lysine | K | 9 | 9.74 | -1.35 | -1.5 | -3.9 | 0.63 |
| Leucine | L | 10 | 5.98 | 2.32 | 1.06 | 3.8 | 0.13 |
| Methionine | M | 11 | 5.47 | 1.68 | 0.64 | 1.9 | 0.13 |
| Asparagine | N | 12 | 5.41 | -0.82 | -0.78 | -3.5 | 0.48 |
| Proline | P | 13 | 6.48 | 0.98 | 0.12 | -1.6 | 0.577 |
| Glutamine | Q | 14 | 5.65 | -0.3 | -0.85 | -3.5 | 0.7 |
| Arginine | R | 15 | 10.76 | -1.37 | -2.53 | -4.5 | 0.323 |
| Serine | S | 16 | 5.68 | -0.05 | -0.18 | -0.8 | 0.238 |
| Threonine | T | 17 | 5.87 | 0.35 | -0.05 | -0.7 | 0.346 |
| Valine | V | 18 | 5.97 | 1.66 | 1.08 | 4.2 | 1 |
| Tryptophan | W | 19 | 5.89 | 3.07 | 0.81 | -0.9 | 0.82 |
| Tyrosine | Y | 20 | 5.67 | 1.31 | 0.26 | -1.3 | 0.33 |
Summary of the feature selection results.
| Feature set | # features | |||
| all | after step 1 | after step 2 | ||
| Length | 1 | 0 | 0 | |
| Index-based | 50 | 5 | 0 | |
| CV and CMV | 60 | 2 | 0 | |
| CV for collocated AAs | 2000 | 4 | 1 | |
| Property group-based | 35 | 1 | 0 | |
| Predicted secondary structure content | 4 | 2 | 0 | |
| Predicted secondary structure-based | with PSI-PRED | 86 | 27 | 8 |
| with YASPIN | 86 | 12 | 0 | |
| Total # of features | 2322 | 53 | 9 | |
| 10 fold cross validation accuracy for prediction on 25PDB dataset | 73.2% | 80.2% | 80.1% | |
The "feature set" columns defines categories of the considered features, the "all" column shows the total number of features in a given category, while the "after step 1" and "after step 2" columns show the corresponding number of features from a given category that were selected in the step 1 and step 2 of the feature selection procedure, respectively.