| Literature DB >> 19811680 |
Bernard Chen1, Matthew Johnson.
Abstract
BACKGROUND: Understanding the relationship between the protein sequence and the 3D structure is a major research area in bioinformatics. The prediction of complete protein tertiary structure based only on sequence information is still an impractical work. This paper aims at revealing the hidden knowledge of the sequence motifs and the local tertiary structure.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19811680 PMCID: PMC3226186 DOI: 10.1186/1471-2105-10-S11-S15
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The sketch of the Super Granule Support Vector Machine (Super GSVM).
The improvement of the number of high quality sequence clusters in each information granule
| G0 | G1 | G2 | G3 | G4 | G5 | G6 | G7 | G8 | G9 | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Total number of clusters | 151 | 76 | 95 | 72 | 70 | 133 | 143 | 5 | 48 | 6 | 799 |
| 60%~70% | 36 | 24 | 24 | 28 | 32 | 31 | 35 | 2 | 15 | 4 | 231 |
| 70%~80% | 21 | 3 | 12 | 4 | 4 | 24 | 20 | 0 | 0 | 0 | 88 |
| >80% | 7 | 0 | 7 | 0 | 0 | 4 | 6 | 0 | 0 | 0 | 24 |
| 60%~70% | 44 | 30 | 31 | 30 | 42 | 39 | 40 | 3 | 24 | 4 | 287 |
| 70%~80% | 27 | 17 | 17 | 16 | 16 | 27 | 30 | 1 | 4 | 1 | 156 |
| >80% | 26 | 2 | 19 | 2 | 1 | 26 | 23 | 0 | 0 | 1 | 100 |
In this work, we use Super GSVM to generate and extract sequence clusters. We divided the whole training dataset into 10 information granules, the second row shows the number of clusters in each information granule (detail of this information can be found in [11]). The table also show the number of clusters belong to excellent (>80% 2nd structural similarity), good (70%~80% 2nd structural similarity), and fair (60%~70% 2nd structural similarity) clusters before and after the Super GSVM extraction.
Prediction accuracy on three clustering groups. Since different distance threshold and different clustering groups generate distinct prediction accuracy, Table 2 shows a detailed report.
| Distance Threshold | Excellent | Good | Fair |
|---|---|---|---|
| 550 | 71.98% | 58.42% | 52.89% |
| 600 | 69.07% | 57.16% | 52.33% |
| 650 | 69.08% | 57.77% | 51.49% |
| 700 | 69.47% | 57.08% | 50.31% |
| 750 | 69.40% | 56.82% | 49.98% |
| 800 | 69.85% | 56.30% | 49.65% |
| 850 | 70.02% | 55.34% | 49.37% |
| 900 | 69.75% | 54.78% | 48.95% |
| 950 | 69.53% | 53.83% | 48.51% |
| 1000 | 68.88% | 53.06% | 48.15% |
| 1050 | 68.26% | 52.34% | 47.78% |
| 1100 | 67.63% | 51.56% | 47.46% |
| 1150 | 67.09% | 51.65% | 47.14% |
| 1200 | 66.54% | 50.75% | 46.92% |
| 1250 | 66.11% | 50.45% | 46.69% |
| 1300 | 65.73% | 50.20% | 46.48% |
Prediction coverage on three clustering groups. Table 3 shows the prediction coverage of three clustering groups under different distance threshold.
| Distance Threshold | Excellent | Good | Fair |
|---|---|---|---|
| 550 | 0.14% | 0.09% | 0.20% |
| 600 | 0.38% | 0.27% | 0.53% |
| 650 | 0.78% | 0.61% | 1.09% |
| 700 | 1.37% | 1.16% | 1.95% |
| 750 | 2.21% | 1.99% | 3.26% |
| 800 | 3.26% | 3.15% | 5.05% |
| 850 | 4.45% | 4.64% | 7.46% |
| 900 | 5.79% | 6.51% | 10.53% |
| 950 | 7.19% | 8.73% | 14.35% |
| 1000 | 8.61% | 11.19% | 18.83% |
| 1050 | 10.05% | 13.82% | 23.80% |
| 1100 | 11.46% | 16.54% | 29.08% |
| 1150 | 12.77% | 19.25% | 34.45% |
| 1200 | 13.95% | 21.80% | 39.58% |
| 1250 | 14.99% | 24.03% | 44.12% |
| 1300 | 15.83% | 25.88% | 47.73% |
Figure 2The sketch of the Fuzzy Greedy K-means (FGK) Model.