| Literature DB >> 16872501 |
Pin-Hao Chi1, Chi-Ren Shyu, Dong Xu.
Abstract
BACKGROUND: Domain experts manually construct the Structural Classification of Protein (SCOP) database to categorize and compare protein structures. Even though using the SCOP database is believed to be more reliable than classification results from other methods, it is labor intensive. To mimic human classification processes, we develop an automatic SCOP fold classification system to assign possible known SCOP folds and recognize novel folds for newly-discovered proteins.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16872501 PMCID: PMC1579235 DOI: 10.1186/1471-2105-7-362
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
A test set that contains 37 protein chains from [13].
| 1 | 63569 | 1 | 48370 | 1 | 48370 | 1 | 48370 | 1 | 48370 |
| 1 | 48370 | 1 | 48370 | 1 | 48725 | 1 | 48725 | 1 | 48725 |
| 1 | 48725 | 1 | 48725 | 1 | 48725 | 1 | 48725 | 1 | 48725 |
| 1 | 48725 | 1 | 50036 | 1 | 50875 | 1 | 50933 | 1 | 50933 |
| 1 | 50933 | 1 | 50933 | 1 | 50933 | 1 | 50933 | 1 | 50933 |
| 1 | 50933 | 1 | 50964 | 1 | 50964 | 1 | 50964 | 1 | 50964 |
| 1 | 50964 | 1 | 50964 | 1 | 50964 | 1 | 50964 | 1 | 51350 |
| 1 | 55199 | 1 | 56234 |
Figure 1The Correct Classification Rate of assigning the known folds for test proteins in Table 1.
The number of proteins in a test set of novel folds, general and non-redundant test sets in which are selected from the known SCOP folds of v2 with at least one protein chain in v1.
| test set | size | test set | size | test set | size |
| 4192 | 442 | - | - | ||
| 4047 | 431 | 94 | |||
| 4547 | 468 | 10 | |||
| 5226 | 491 | 190 | |||
| 5445 | 494 | 48 | |||
| 10521 | 736 | 215 | |||
| 5604 | 585 | 86 |
The number of proteins in general and non-redundant test sets in which are selected from the known SCOP folds of v2 with at least 10 protein chains in v1.
| test set | size | test set | size |
| 1832 | 158 | ||
| 1901 | 168 | ||
| 2136 | 166 | ||
| 1947 | 189 | ||
| 2062 | 198 | ||
| 4735 | 302 | ||
| 2298 | 263 |
Figure 2The Correct Classification Rate of assigning the known folds for various SCOP releases using E-Predict on (a) general and non-redundant test set in which are selected from the known SCOP folds of v2 with at least one protein chain in v1 (Table 2) (b) general and non-redundant test set in which are selected from the known SCOP folds of v2 with at least 10 protein chains in v1 (Table 3).
Figure 3The amount of proteins in the folds against the number of SCOP folds in the SCOP v1.69 release.
The sequence redundancy in a set that contains 10 pairs of proteins, which are randomly sampled from
| sequence identity | sequence similarity | |||||
| 01 | 1 | 55008 | 1 | 110997 | 2.10% | 3.50% |
| 02 | 1 | 82708 | 1 | 82704 | 12.80% | 26.80% |
| 03 | 1 | 57889 | 1 | 75471 | 13.60% | 23.50% |
| 04 | 1 | 103196 | 1 | 103247 | 22.40% | 34.20% |
| 05 | 1 | 55724 | 1 | 55797 | 6.80% | 10.80% |
| 06 | 1 | 56925 | 1 | 55961 | 18.10% | 28.40% |
| 07 | 1 | 55826 | 1 | 55846 | 17.70% | 30.50% |
| 08 | 1 | 55781 | 1 | 55676 | 10.30% | 17.50% |
| 09 | 1 | 55895 | 1 | 55469 | 9.00% | 14.70% |
| 10 | 1 | 55931 | 1 | 55909 | 12.70% | 21.80% |
| Avg. 12.55% | Avg. 21.17% |
Figure 4The Correct Classification Rates of recognizing the novel SCOP folds for proteins in various SCOP releases.
Figure 5The protein chain sizes against the average response time of classifying test proteins.
Figure 6Correct Classification Rates of classifying test proteins against structural variation values.
Figure 7E-Predict model for assigning newly-discovered proteins to the known folds.
Figure 8The 3-D backbone structures and distance matrices of four protein chains, which are selected from the SCOP folds: (1)Heme-dependent peroxidases: 1kta_A(a-b), 1ekv_A(c-d), (2)Acid proteases : 1lee_A(e-f), 1lf2_A(g-h).
Local features of proteins from the SCOP folds: (1)Heme-dependent peroxidases : 1stq_A, 1sog_A, (2) Acid proteases: 1lee_A, 1lf2_A. Histogram [a,b] denotes the distance histogram for the aband region and the bgrayscale bin.
| Image Features | 1stq_A | 1sog_A | 1lee_A | 1lf2_A |
| Histogram [1,1] | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Histogram [l,2] | 0.0002 | 0.0002 | 0.0000 | 0.0000 |
| Histogram [l,3] | 0.0018 | 0.0020 | 0.0001 | 0.0002 |
| Histogram [1,4] | 0.0050 | 0.0053 | 0.0009 | 0.0011 |
| Histogram [2,1] | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Histogram [2,2] | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Histogram [2,3] | 0.0023 | 0.0022 | 0.0000 | 0.0001 |
| Histogram [2,4] | 0.0044 | 0.0043 | 0.0012 | 0.0010 |
| Histogram [3,1] | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Histogram [3,2] | 0.0004 | 0.0004 | 0.0003 | 0.0004 |
| Histogram [3,3] | 0.0020 | 0.0019 | 0.0017 | 0.0019 |
| Histogram [3,4] | 0.0092 | 0.0080 | 0.0048 | 0.0055 |
| Histogram [4,1] | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Histogram [4,2] | 0.0006 | 0.0006 | 0.0015 | 0.0012 |
| Histogram [4,3] | 0.0040 | 0.0042 | 0.0056 | 0.0053 |
| Histogram [4,4] | 0.0132 | 0.0130 | 0.0172 | 0.0166 |
| Histogram [5,1] | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Histogram [5,2] | 0.0014 | 0.0015 | 0.0036 | 0.0035 |
| Histogram [5,3] | 0.0124 | 0.0128 | 0.0133 | 0.0134 |
| Histogram [5,4] | 0.0423 | 0.0425 | 0.0298 | 0.0304 |
| Histogram [6,1] | 0.0203 | 0.0201 | 0.0179 | 0.0180 |
| Histogram [6,2] | 0.0392 | 0.0386 | 0.0291 | 0.0289 |
| Histogram [6,3] | 0.0503 | 0.0496 | 0.0474 | 0.0485 |
| Histogram [6,4] | 0.0845 | 0.0833 | 0.0796 | 0.0795 |
Global features of proteins from the SCOP folds: (1)Heme-dependent peroxidases: 1stq_A, 1sog_A, (2)Acid proteases: 1lee_A, 1lf2_A.
| Image Features | 1stq_A | 1sog_A | 1lee_A | 1lf2_A |
| Dimension | 291.00 | 294.00 | 331.00 | 329.00 |
| Binary_Threshold | 23.0000 | 23.0000 | 26.0000 | 26.0000 |
| Texture_Energy | 0.0155 | 0.0153 | 0.0107 | 0.0107 |
| Texture_Entropy | 51.7143 | 51.8067 | 54.4426 | 54.4139 |
| Texture_Homogenity1 | 2.5344 | 2.5261 | 2.2184 | 2.2192 |
| Texture_Homogenity2 | 1.7608 | 1.7529 | 1.4467 | 1.4485 |
| Texture_Contrast | 0.0027 | 0.0027 | 0.0041 | 0.0041 |
| Texture_Correlation | 6.8883 | 6.8914 | 6.7682 | 6.7659 |
| Texture_Cluster_Tendency | 0.0387 | 0.0392 | 0.0517 | 0.0515 |
Figure 9A comparison of classification performance between E-Predict, NN, 3-NN, 5-NN, and C4.5 DT classifiers using (a) testing proteins in which are selected from the SCOP folds in v2 that have at least one protein in v1 (b) testing proteins in which are selected from the SCOP folds in v2 that have at least 10 proteins in v1.
Figure 10An example of E_Measure calculations for two SCOP folds in a list of nearest neighbor proteins.
Figure 11E-Predict model for recognizing the novel folds for newly-discovered proteins.
Figure 12An example of identifying for a newly-discovered protein Pin the novel folds by selecting the nearest neighbor protein in a fold F* derived from the E-Predict algorithm.
A comparison of the three features for proteins in the novel folds and the known folds.
| ( | ( | ( | |
| High | High | High | |
| Low | Low | Low |
Figure 13The superimposition of a newly-discovered protein and a known protein chain from the top ranked SCOP fold.
Appendix 1 E-Predict Algorithm
| 1: ∏ = ∅ |
| 2: |
| 3: |
| 4: ∏ = ∏ ∪ { |
| 5: |
| 6: |
| 7: |
| 8: |
| 9: |
| 10: |
| 11: |
| 12: ∏ ← ∏ - { |
| 13: |
| 14: |
| 15: |
| 16: |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: |
| 22: |
| 23: |
| 24: |
| 25: |
| 26: |
| 27: |
| 28: |
| 29: F* ← |
| 30: |
| 31: |