| Literature DB >> 18513429 |
Yi Shi1, Jianjun Zhou, David Arndt, David S Wishart, Guohui Lin.
Abstract
BACKGROUND: Contact order is a topological descriptor that has been shown to be correlated with several interesting protein properties such as protein folding rates and protein transition state placements. Contact order has also been used to select for viable protein folds from ab initio protein structure prediction programs. For proteins of known three-dimensional structure, their contact order can be calculated directly. However, for proteins with unknown three-dimensional structure, there is no effective prediction method currently available.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18513429 PMCID: PMC2440764 DOI: 10.1186/1471-2105-9-255
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
SCOP classification of the 933 training monomeric proteins
| SCOP label | Quantity |
| All alpha proteins | 83 |
| All beta proteins | 89 |
| Alpha and beta proteins | 750 |
| Peptides | 10 |
| unstructured | 1 |
The SCOP classification of the 933 monomeric proteins used for regression.
Figure 1The length distribution of the 933 monomeric proteins used for regression.
Figure 2Scatter plot of the actual Abs_CO values of the 742 out of 1, 000 testing sequences versus the predicted Abs_CO values based on the top homology hit, where the 742 were obtained by setting the length difference threshold at 40% and the sequence identity threshold at 20%.
Figure 3Scatter plot for the sequence homology-based method, showing percent correct versus sequence identity for the 742 pairs of testing sequences and their corresponding homologs. A 5-fold cross-validation was performed with 5 samples of 200 sequences each, and the combined results for all of the 742 sequences with homologs are plotted here. Sequence identity was computed as the number of identical residues divided by the query sequence length.
Figure 4Scatter plot for the sequence homology-based method, showing percent correct versus RMSD between the structure of the query protein and the structure of the homologous protein. The experiment setting is the same as for generating Figure 3. In the plot, the red crosses show the average percent correct for their nearby points within ± 0.5 Å.
Performances of all 9 regression-based Abs_CO prediction methods
| Correlation Coefficient | Average Percent Correct | Standard Deviation | |||||||
| Method | LR | SVR | NN | LR | SVR | NN | LR | SVR | NN |
| Method 1 (F3) | 0.8571 | 0.8583 | 0.8513 | 0.8630 | 0.8659 | 0.8543 | 0.1099 | 0.1060 | 0.1445 |
| Method 1 (F7) | 0.8603 | 0.8543 | 0.8737 | 0.8636 | 0.8681 | 0.8656 | 0.1091 | 0.1094 | 0.1110 |
| Method 1 (F27) | 0.8702 | 0.8658 | 0.7713 | 0.8659 | 0.8717 | 0.8086 | 0.1060 | 0.1291 | 1.2222 |
| Method 2 (F3) | 0.8477 | 0.8440 | 0.8680 | 0.8577 | 0.8664 | 0.8612 | 0.1179 | 0.1129 | 0.1073 |
| Method 2 (F7) | 0.8499 | 0.8455 | 0.8667 | 0.8570 | 0.8669 | 0.8656 | 0.1161 | 0.1142 | 0.1110 |
| Method 2 (F27) | 0.8662 | 0.8550 | 0.8233 | 0.8620 | 0.8714 | 0.7016 | 0.1105 | 0.1167 | 3.3827 |
| Method 3 (F3) | 0.8375 | 0.8378 | 0.8439 | 0.8542 | 0.8647 | 0.8560 | 0.1200 | 0.1168 | 0.1103 |
| Method 3 (F7) | 0.8421 | 0.8381 | 0.8636 | 0.8537 | 0.8637 | 0.8648 | 0.1192 | 0.1165 | 0.1156 |
| Method 3 (F27) | 0.8625 | 0.8512 | 0.7706 | 0.8605 | 0.8695 | 0.8106 | 0.1126 | 0.1194 | 0.3662 |
The correlation coefficients, the average percent correct values, and the standard deviations of the percent correct values for all 9 regression-based Abs_CO prediction methods. The first set of results (rows 3–5) are for Method 1 in which the percentage of residues in alpha helices, p(α), and beta strands, p(β), are used as two factors. In particular, the F3-LR regression formula is Abs_CO = -6.8968p(α) + 7.6216p(β) + 0.0612L + 8.0397. The second set of results (rows 6–8) are for Method 2 in which the numbers of residues in alpha helices, q(α), and beta strands, q(β), are used as two factors. In particular, the F3-LR regression formula is Abs_CO = -0.0591q(α) + 0.0789L + 8.3774. The third set of results (rows 9–11) are for Method 3 in which the numbers of alpha helices, n(α), and beta strands, n(β), are used as two factors. In particular, the F3-LR regression formula is Abs_CO = -0.4184n(α) + 0.1992n(β) + 0.0611L + 8.647.
Figure 5Scatter plot of the actual versus the predicted Abs_CO values by F3-LR (under the 5-fold cross validation scheme).
Figure 6Scatter plot of the actual versus the predicted Abs_CO values by F27-LR (under the 5-fold cross validation scheme).