| Literature DB >> 16524467 |
Jieun Jeong1, Piotr Berman, Teresa Przytycka.
Abstract
BACKGROUND: It has been proposed that secondary structure information can be used to classify (to some extend) protein folds. Since this method utilizes very limited information about the protein structure, it is not surprising that it has a higher error rate than the approaches that use full 3D fold description. On the other hand, the comparing of 3D protein structures is computing intensive. This raises the question to what extend the error rate can be decreased with each new source of information, especially if the new information can still be used with simple alignment algorithms. We consider the question whether the information about closed loops can improve the accuracy of this approach. While the answer appears to be obvious, we had to overcome two challenges. First, how to code and to compare topological information in such a way that local alignment of strings will properly identify similar structures. Second, how to properly measure the effect of new information in a large data sample. We investigate alternative ways of computing and presenting this information.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16524467 PMCID: PMC1434743 DOI: 10.1186/1472-6807-6-3
Source DB: PubMed Journal: BMC Struct Biol ISSN: 1472-6807
Raw scores and log-odds scores for individual folds.
| fold number | fold size | score | Log-odds score | impact on the average | ||||||
| ours | ours | |||||||||
| SSEA | NCL | CL | SSEA | NCL | CL | U | R | L | ||
| 1 | 242 | 0.462 | 0.461 | 0.493 | 0.748 | 0.751 | 0.813 | 2 | ||
| 2 | 27 | 0.054 | 0.089 | 0.077 | 0.293 | 0.832 | 0.659 | |||
| 3 | 11 | 0.096 | 0.087 | 0.096 | 1.276 | 1.238 | 1.274 | |||
| 4 | 2 | 1.000 | 1.000 | 1.000 | 3.952 | 4.030 | 3.945 | |||
| 6 | 40 | 0.199 | 0.303 | 0.257 | 1.366 | 1.815 | 1.619 | -1 | -1 | |
| 7 | 15 | 0.299 | 0.378 | 0.315 | 2.300 | 2.583 | 2.345 | |||
| 8 | 3 | 0.333 | 0.334 | 0.334 | 2.811 | 2.888 | 2.804 | |||
| 11 | 9 | 0.112 | 0.221 | 0.377 | 1.494 | 2.238 | 2.706 | 1 | 1 | |
| 12 | 4 | 0.505 | 0.334 | 0.151 | 3.186 | 2.843 | 1.975 | -2 | -1 | |
| 15 | 3 | 0.062 | 0.378 | 1.000 | 1.137 | 3.010 | 3.903 | 2 | 1 | |
| 17 | 3 | 0.521 | 0.833 | 0.833 | 3.257 | 3.802 | 3.720 | |||
| 18 | 29 | 0.140 | 0.126 | 0.169 | 1.215 | 1.142 | 1.399 | 1 | 1 | |
| 19 | 6 | 0.152 | 0.167 | 0.177 | 1.907 | 2.069 | 2.056 | |||
| 21 | 3 | 0.339 | 0.333 | 0.333 | 2.827 | 2.886 | 2.804 | |||
| 22 | 7 | 1.000 | 0.717 | 0.719 | 3.756 | 3.487 | 3.420 | |||
| 23 | 6 | 0.304 | 0.276 | 0.677 | 2.601 | 2.571 | 3.396 | 2 | 1 | |
| 24 | 5 | 0.183 | 0.473 | 0.850 | 2.131 | 3.149 | 3.661 | 1 | ||
| 26 | 5 | 0.104 | 0.104 | 0.150 | 1.569 | 1.633 | 1.929 | |||
| 29 | 33 | 0.361 | 0.258 | 0.581 | 2.084 | 1.779 | 2.557 | 1 | 3 | 4 |
| 30 | 17 | 0.286 | 0.299 | 0.456 | 2.200 | 2.292 | 2.663 | 1 | 1 | |
| 31 | 2 | 0.035 | 0.016 | 0.009 | 0.604 | 0.000 | 0.000 | |||
| 33 | 7 | 0.404 | 0.303 | 0.839 | 2.850 | 2.627 | 3.574 | 2 | 2 | 1 |
| 34 | 63 | 0.340 | 0.160 | 0.271 | 1.582 | 0.846 | 1.354 | 1 | 3 | 5 |
| 35 | 3 | 0.104 | 0.043 | 0.021 | 1.648 | 0.837 | 0.063 | -1 | -1 | |
| 36 | 20 | 0.638 | 0.492 | 0.680 | 2.927 | 2.709 | 2.986 | 1 | ||
| 37 | 2 | 0.008 | 0.004 | 0.000 | 0.000 | 0.000 | 0.000 | |||
| 38 | 11 | 0.610 | 0.228 | 0.418 | 3.129 | 2.201 | 2.745 | 1 | 1 | 1 |
| 39 | 2 | 1.000 | 1.000 | 1.000 | 3.952 | 4.030 | 3.945 | |||
| 40 | 90 | 0.244 | 0.277 | 0.301 | 0.973 | 1.112 | 1.182 | 1 | ||
| 41 | 2 | 1.000 | 1.000 | 1.000 | 3.952 | 4.030 | 3.945 | |||
| 42 | 24 | 0.389 | 0.309 | 0.447 | 2.340 | 2.149 | 2.475 | 1 | 1 | |
| 43 | 27 | 0.300 | 0.244 | 0.246 | 2.016 | 1.844 | 1.813 | |||
| 44 | 3 | 0.833 | 0.418 | 0.333 | 3.727 | 3.112 | 2.804 | |||
| 45 | 4 | 0.376 | 1.000 | 1.000 | 2.890 | 3.941 | 3.862 | |||
| 46 | 2 | 0.001 | 0.039 | 0.000 | 0.000 | 0.787 | 0.000 | -1 | ||
| 47 | 32 | 0.753 | 0.424 | 0.395 | 2.838 | 2.296 | 2.189 | |||
| 49 | 2 | 1.000 | 1.000 | 1.000 | 3.952 | 4.030 | 3.945 | |||
| 50 | 11 | 0.382 | 0.352 | 0.331 | 2.662 | 2.635 | 2.513 | |||
| 51 | 4 | 0.199 | 0.502 | 0.531 | 2.254 | 3.251 | 3.230 | |||
| 52 | 10 | 0.353 | 0.256 | 0.308 | 2.615 | 2.351 | 2.471 | |||
| 53 | 3 | 0.003 | 0.064 | 0.500 | 0.000 | 1.235 | 3.210 | 4 | 2 | 1 |
| 55 | 30 | 0.531 | 0.492 | 0.607 | 2.526 | 2.483 | 2.656 | |||
| 56 | 2 | 1.000 | 1.000 | 1.000 | 3.952 | 4.030 | 3.945 | |||
| 57 | 3 | 1.000 | 1.000 | 1.000 | 3.910 | 3.984 | 3.903 | |||
| 58 | 3 | 0.335 | 1.000 | 1.000 | 2.815 | 3.984 | 3.903 | |||
| 60 | 22 | 0.489 | 0.488 | 0.478 | 2.613 | 2.652 | 2.587 | |||
| 61 | 9 | 0.127 | 0.084 | 0.352 | 1.623 | 1.264 | 2.637 | 3 | 3 | 2 |
| 62 | 2 | 1.000 | 1.000 | 1.000 | 3.952 | 4.030 | 3.945 | |||
| 63 | 2 | 1.000 | 1.000 | 1.000 | 3.952 | 4.030 | 3.945 | |||
| 64 | 2 | 0.016 | 1.000 | 1.000 | 0.000 | 4.030 | 3.945 | |||
| 65 | 2 | 0.126 | 0.001 | 0.063 | 1.877 | 0.000 | 1.174 | 2 | 1 | |
| 66 | 5 | 1.000 | 0.900 | 1.000 | 3.830 | 3.793 | 3.824 | |||
| 67 | 2 | 0.002 | 0.016 | 0.000 | 0.000 | 0.000 | 0.000 | |||
| 68 | 13 | 0.550 | 0.382 | 0.552 | 2.966 | 2.653 | 2.964 | |||
| 69 | 23 | 0.747 | 0.711 | 0.750 | 3.014 | 3.005 | 3.015 | |||
| 70 | 5 | 0.475 | 0.368 | 0.591 | 3.086 | 2.900 | 3.297 | |||
| 71 | 24 | 0.342 | 0.148 | 0.281 | 2.212 | 1.412 | 2.010 | 1 | 2 | 2 |
| 72 | 10 | 0.602 | 0.454 | 0.533 | 3.148 | 2.922 | 3.021 | |||
| 74 | 4 | 1.000 | 1.000 | 1.000 | 3.869 | 3.941 | 3.862 | |||
| 76 | 2 | 0.000 | 0.531 | 0.000 | 0.000 | 3.397 | 0.000 | -8 | -3 | -1 |
| 77 | 5 | 0.331 | 0.259 | 0.303 | 2.725 | 2.549 | 2.629 | |||
| 80 | 18 | 0.302 | 0.305 | 0.431 | 2.228 | 2.282 | 2.582 | 1 | ||
| 81 | 9 | 0.559 | 0.643 | 0.591 | 3.106 | 3.304 | 3.156 | |||
| 82 | 35 | 0.173 | 0.205 | 0.399 | 1.311 | 1.512 | 2.143 | 1 | 3 | 3 |
| 83 | 2 | 0.258 | 0.031 | 0.012 | 2.597 | 0.565 | 0.000 | -1 | ||
| 84 | 14 | 0.165 | 0.143 | 0.225 | 1.735 | 1.638 | 2.038 | 1 | ||
| 85 | 14 | 0.076 | 0.070 | 0.141 | 0.959 | 0.924 | 1.571 | 1 | 1 | 1 |
| 86 | 3 | 0.355 | 1.000 | 1.000 | 2.874 | 3.984 | 3.903 | |||
| 87 | 4 | 0.167 | 0.208 | 0.501 | 2.080 | 2.373 | 3.171 | 1 | 1 | |
| 88 | 3 | 0.089 | 0.011 | 0.048 | 1.489 | 0.000 | 0.870 | 2 | 1 | |
| 91 | 2 | 0.750 | 1.000 | 1.000 | 3.664 | 4.030 | 3.945 | |||
| 92 | 2 | 0.009 | 0.008 | 0.062 | 0.000 | 0.000 | 1.172 | 2 | 1 | |
| 93 | 2 | 1.000 | 1.000 | 1.000 | 3.952 | 4.030 | 3.945 | |||
| 104 | 2 | 0.094 | 0.125 | 0.500 | 1.585 | 1.951 | 3.252 | 3 | 1 | |
| 106 | 3 | 0.006 | 0.009 | 0.003 | 0.000 | 0.000 | 0.000 | |||
| 108 | 2 | 0.250 | 0.002 | 0.000 | 2.566 | 0.000 | 0.000 | |||
| 113 | 4 | 1.000 | 0.875 | 1.000 | 3.869 | 3.807 | 3.862 | |||
| 118 | 2 | 1.000 | 1.000 | 1.000 | 3.952 | 4.030 | 3.945 | |||
| 121 | 49 | 0.211 | 0.246 | 0.305 | 1.287 | 1.465 | 1.654 | 1 | 1 | |
| 122 | 5 | 0.157 | 0.120 | 0.188 | 1.978 | 1.779 | 2.153 | |||
| 125 | 2 | 1.000 | 1.000 | 1.000 | 3.952 | 4.030 | 3.945 | |||
| Unweighted avg. | 0.417 | 0.434 | 0.501 | 2.300 | 2.368 | 2.500 | ||||
| 0.389 | 0.378 | 0.453 | 2.080 | 2.088 | 2.263 | |||||
| weighted avg. | 0.376 | 0.351 | 0.421 | 1.705 | 1.675 | 1.853 | ||||
Raw scores and log-odds scores for 81 folds that had more than one representative in our data. SSEA score was obtained by taking structure determinations of DSSP and computing the scores using the publically available binary code of SSEA program. Our scores were computed using our alignment program and using our structure determinations, which were similar but not identical to DSSP. Averages are: unweighted (U), root weighted (R) – fold with k proteins get weight and weighted (L), where the weight is k. "Impact on the average" shows how the respective average would change if all other folds had identical scores; we multiply this change by 200 and round toward 0; zeroes are not shown.
Clustering scores of various methods.
| Sample | Size | averaging method | SSEA | DSSP | Ours | ||
| NCL | CL | NCL | CL | ||||
| ALL | 1183 | U | 2.30 | 2.27 | 2.49 | 2.36 | 2.50 |
| R | 2.08 | 2.07 | 2.27 | 2.09 | 2.26 | ||
| L | 1.71 | 1.70 | 1.84 | 1.68 | 1.85 | ||
| MEDIUM | 631 | U | 1.82 | 1.87 | 1.98 | 1.81 | 2.04 |
| R | 1.62 | 1.66 | 1.77 | 1.59 | 1.78 | ||
| L | 1.18 | 1.18 | 1.27 | 1.11 | 1.26 | ||
| LONG | 475 | U | 1.96 | 2.03 | 2.05 | 1.92 | 2.00 |
| R | 1.81 | 1.85 | 1.90 | 1.76 | 1.86 | ||
| L | 1.64 | 1.68 | 1.73 | 1.61 | 1.71 | ||
| RANDOM | 591 | U | 1.76 | 1.77 | 1.87 | 1.88 | 1.98 |
| R | 1.64 | 1.63 | 1.73 | 1.71 | 1.81 | ||
| L | 1.42 | 1.37 | 1.47 | 1.43 | 1.53 | ||
Average log-odds score of various clustering functions. Sample MEDIUM consists of those protein domains in ALL that have between 70 and 140 residues, and LONG are those that are longer. RANDOM is the average of 40 samples obtained by splitting ALL in a random fashion into equal parts (on the average). Averaging methods: U is unweighted, R is weighted with the root of fold size and L is weighted with the fold size (in a sample); in each case folds that have fewer than 2 representatives in a sample are excluded. SSEA is the score computed by SSEA program from DSSP output, DSSP is the score obtained from DSSP output and our alignment program, "ours" uses our structure determination and our alignment programs. Our annotations of closed loops were transferred to DSSP output to obtain CL version of that score.
Figure 1Sums of average unweighted log-odds scores with weighted log-odds scores for different values of L. The value for L = 0 corresponds to NCL.
Figure 2Ideal cases of a parallel and anti-parallel beta sheets. Residue numbers are surrounded by the backbone atoms of the respective residue, differences of hydrogen bonds are positioned next to the respective bonds and second differences are placed in boxes.