| Literature DB >> 21143813 |
Hsin-Nan Lin1, Ting-Yi Sung, Shinn-Ying Ho, Wen-Lian Hsu.
Abstract
BACKGROUND: When characterizing the structural topology of proteins, protein secondary structure (PSS) plays an important role in analyzing and modeling protein structures because it represents the local conformation of amino acids into regular structures. Although PSS prediction has been studied for decades, the prediction accuracy reaches a bottleneck at around 80%, and further improvement is very difficult.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21143813 PMCID: PMC3005913 DOI: 10.1186/1471-2164-11-S4-S4
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1A local sequence alignment derived by PSI-BLAST. The identical residues are labelled with letters and conserved substitutions are labelled with + symbols. The alignment in this example shows that the sequence fragment from position 7 to position 46 of the query sequence is very similar to that from position 3 to position 42 in the subject sequence. It is assumed that the two sequences have a similar semantic relation because they form a significant sequence alignment.
Figure 2The procedure used to extract protein words and synonymous words for a query protein The procedure used to extract protein words and their synonymous words for a given query protein p (assuming the window size n is 4). We use a sliding window to screen the query sequence and all the similar protein sequences found by PSI-BLAST and extract all words. Each word is associated with a piece of structural information of the region from which it is extracted. The protein source of all the extracted words is the query protein p, since all the structural information is derived from p.
An example of a synonymous word entry in
| Synonymous word: WGPV | |||
|---|---|---|---|
| Protein Source | Secondary Structure | Similarity Level | Frequency |
| HHHH | 3 | 7 | |
| HHCH | 4 | 11 | |
| CHHH | 2 | 3 | |
An example of a synonymous word entry in SynonymDict (assuming the word length n = 4). WGPV is a synonymous word of proteins A, B and C, since it is extracted from the similar proteins of A, B and C. We record the structural information of protein sources to the corresponding synonymous words, and calculate the corresponding similarity levels and frequencies. For example, the similarity level of WGPV in terms of protein source A is 3 and the frequency is 7.
Performnace comparison of SymPred, SymPsiPred, and PROSP on the DsspNr-25 dataset.
| sov | sovH | sovE | sovC | |||||
|---|---|---|---|---|---|---|---|---|
| SymPred* | 81.0 | 84.3 | 71.6 | 77.7 | 76.0 | 82.5 | 76.9 | 70.7 |
| SymPred+ | 80.5 | 84.1 | 70.9 | 77.5 | 75.6 | 82.3 | 76.4 | 70.3 |
| SymPsiPred | 83.9 | 81.5 | 75.8 | 83.9 | 80.2 | 82.3 | 80.3 | 76.5 |
| PROSP | 75.1 | 79.7 | 67.6 | 71.3 | 68.7 | 77.0 | 73.0 | 63.4 |
Q3Ho (Q3Eo and Q3Co, respectively) represents correctly predicted helix (strand and coil, respectively) residues (percentage of helix observed). sovH/E/C values are the specific SOV accuracies of the predicted helix, strand and coil, respectively. SymPred* represents the experiment result using leave-one-out cross validation and SymPred+ represents the experiment result using 10-fold cross validation.
The Q3 accuracies of SymPred using exact and inexact matchings on different word lengths.
| 78.2 | 80.1 | 78.1 | 76.2 | |
| 74.9 | 79.2 | 80.5 | 79.0 |
The Q3 accuracy comparison of SymPred using dictionaries compiled from different percentages of the template proteins.
| Percentage of template pool | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
|---|---|---|---|---|---|---|---|---|---|---|
| Number of template proteins | 830 | 1660 | 2490 | 3320 | 4150 | 4980 | 5809 | 6638 | 7467 | 8297 |
| 70.8 | 73.6 | 75.0 | 76.3 | 77.3 | 78.1 | 78.7 | 79.3 | 79.8 | 80.5 | |
| Improvement | - | +2.8 | +1.4 | +1.3 | +1.0 | +0.8 | +0.6 | +0.6 | +0.5 | +0.7 |
The performance improves as the number of template proteins increases. SymPred’s performance improves between 0.5% and 2.8% each time the number of template proteins is increased by 10%.
Comparison of SymPred’s prediction performance on different-sized template pools.
| Number of template proteins | 8297 | 12975 | 16391 |
| Synonymous dictionary | |||
Figure 3Relationships between . The correlation coefficient between the confidence levels and Q3 accuracies for SymPred is 0.992.
The prediction performance of different methods on the EVA benchmark datasets.
| sov | sovH | sovE | sovC | ||||
|---|---|---|---|---|---|---|---|
| SymPred | 78.8 | ±1.4 | 76.4 | ±1.9 | 85.0 | 76.5 | 70.4 |
| SAM-T99sec | 77.2 | ±1.2 | 74.6 | ±1.5 | 80.9 | 72.5 | 71.2 |
| PSIPRED | 76.8 | ±1.4 | 75.4 | ±2.0 | 82.1 | 72.3 | 65.2 |
| PROFsec | 75.5 | ±1.4 | 74.9 | ±1.9 | 78.3 | 75.9 | 71.3 |
| PHDpsi | 73.4 | ±1.4 | 69.5 | ±1.9 | 73.7 | 73.9 | 65.2 |
| sov | sovH | sovE | sovC | ||||
| SymPred | 79.2 | ±0.9 | 76.0 | ±1.2 | 85.1 | 77.7 | 71.3 |
| PSIPRED | 77.8 | ±0.8 | 75.4 | ±1.1 | 80.6 | 72.6 | 70.4 |
| PROFsec | 76.7 | ±0.8 | 74.8 | ±1.1 | 79.2 | 76.2 | 71.8 |
| PHDpsi | 75.0 | ±0.8 | 70.9 | ±1.2 | 77.0 | 72.4 | 67.0 |
sovH/E/C values are the specific SOV accuracies of the predicted helix, strand and coil, respectively. The prediction results of other methods on EVA_Set 1 and EVA_Set2 are reported at http://cubic.bioc.columbia.edu/eva/sec/common3.html.
The relationship between the number of distinct synonymous words and the prediction performance.
| Selection criterion | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Number of selected proteins | 8297 | 7983 | 7252 | 6660 | 6178 | 5637 | 5035 | 4378 | |
| SymPred | 81.0 | 81.6 | 82.3 | 82.8 | 83.1 | 83.3 | 83.4 | 83.5 | |
| SymPsiPred | 83.9 | 84.3 | 84.8 | 85.1 | 85.2 | 85.3 | 85.4 | 85.5 | |
For each test protein t of length L in DsspNr-25, let v denote the number of distinct synonymous words of t. Define e = v/L, the multiplicity of v over L. If e is greater than or equal to a threshold, the protein t is selected. The results show that there is a positive correlation between the number of distinct synonymous words and the prediction performance of SymPred and SymPsiPred.
Figure 4The distribution of synonymous words shared by 1aab and 1j46_A. The x- and y- axes represent the sequence of 1j46_A and 1aab respectively. A grayscale pixel represents the number of shared synonymous words corresponding to a residue pair (x), where xi and y denote a residue pair comprised of the i-th residue of 1j46_A and the j-th residue of 1aab respectively. Box B is a zoom-in of Box A. The red lines indicate the alignment based on the number of shared synonymous words, and the alignment is very close to that reported in Balibase for the two proteins. Notably, it can be observed that the path of the darker pixels is nearly perfectly matched the suggested alignment.