| Literature DB >> 18218144 |
Xin-Qiu Yao1, Huaiqiu Zhu, Zhen-Su She.
Abstract
BACKGROUND: Protein secondary structure prediction method based on probabilistic models such as hidden Markov model (HMM) appeals to many because it provides meaningful information relevant to sequence-structure relationship. However, at present, the prediction accuracy of pure HMM-type methods is much lower than that of machine learning-based methods such as neural networks (NN) or support vector machines (SVM).Entities:
Mesh:
Substances:
Year: 2008 PMID: 18218144 PMCID: PMC2266706 DOI: 10.1186/1471-2105-9-49
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The influence of window sizes on the . Land Lare window sizes for profile and secondary structure, respectively. The results are obtained by testing DBNsigmoid on the SD576 dataset.
Figure 2Illustration of the DBN model. (a) An example of PSSM, where rows represent residue sites and columns represent amino acids. The "SS" column contains the secondary structure of each site, classified as H (helix), E (sheet), and C (coil). (b) A graphical representation of the DBN. The shadow nodes represent observable random variables, while clear nodes represent hidden (in prediction) variables. The arcs with arrows represent dependency between nodes. The contents of the nodes R, AA, d, and SSare derived as illustrated by the connections of dashed lines, where the subscript indicates the residue site. More detailed description of R, AA, d, SS, D, and Fcan be found in the text. Land Lare windows sizes for profile and secondary structure, respectively (in this example, L= 4 and L= 2). (c) Is a reduced version of (b) with L= 0 and L= 0.
Performance of basic DBN and NN models and their combinations tested on SD576.
| Model | |||||
| DBNlinear+NC | 75.1 | 74.0 | 0.69 | 0.60 | 0.55 |
| DBNlinear+CN | 74.6 | 73.3 | 0.68 | 0.61 | 0.53 |
| DBNlinear | 77.0 | 75.8 | 0.72 | 0.64 | 0.58 |
| DBNsigmoid+NC | 75.8 | 74.5 | 0.72 | 0.60 | 0.56 |
| DBNsigmoid+CN | 74.6 | 73.3 | 0.69 | 0.61 | 0.54 |
| DBNsigmoid | 77.4 | 75.9 | 0.74 | 0.64 | 0.59 |
| DBNfinal | 78.2 | 76.8 | 0.74 | 0.65 | 0.60 |
| NNlinear | 77.6 | 73.2 | 0.72 | 0.64 | 0.60 |
| NNsigmoid | 77.1 | 71.0 | 0.72 | 0.63 | 0.59 |
| NNfinal | 77.8 | 73.3 | 0.73 | 0.64 | 0.60 |
| DBNN | 80.0 | 78.1 | 0.77 | 0.68 | 0.63 |
All the eleven models listed in the table are described in Methods. The average results of seven-fold cross-validation are shown.
Figure 3Segment length distributions of helices, sheets, and coils. (a) The observed distributions calculated directly from SD576 dataset. Inset is lin-log plots of the distributions, where the lines show fitting exponential tails for the three types of secondary structure segments. (b) The comparison between the distribution of helices observed in the dataset and those predicted by DBNfinal and DBNgeo. (c) The comparison of distributions between observation and prediction of sheets. (d) The comparison of distributions between observation and prediction of coils.
Performance of DBNgeo, DBNfinal, and DBNmod tested on SD576.
| Model | Relative entropy (bit) | |||||
| Helix | Sheet | Coil | Average | |||
| DBNgeo | 76.7 | 74.3 | 0.247 | 0.170 | 0.290 | 0.236 |
| DBNfinal | 78.2 | 76.8 | 0.236 | 0.096 | 0.210 | 0.181 |
| DBNmod | 78.2 | 76.3 | 0.214 | 0.038 | 0.110 | 0.121 |
The seven-fold cross-validation test results on three models with different segment length distributions are explained in the text. The performance is measured by Q3, SOV, and the relative entropies between the observed segment length distributions from SD576 and the model's predictions [Eq. (2)]. Clearly, DBNfinal and DBNmod have visible improvement over DBNgeo.
Comparative performance of DBNfinal and DBNdiag against leading HMM-type methods tested on CB513.
| Method | |||||
| HMMCrooks | 72.8 | -- | -- | -- | -- |
| HMMChu | 72.2 | 68.3 | 0.61 | 0.52 | 0.51 |
| DBNdiag/ErrSig | 72.5/0.42 | 65.9/0.63 | 0.66/0.01 | 0.55/0.01 | 0.51/0.01 |
| DBNfinal/ErrSig | 76.3/0.41 | 72.7/0.63 | 0.71/0.01 | 0.61/0.01 | 0.57/0.01 |
DBNfinal and DBNdiag are methods developed in this work and their descriptions can be found in the text. Entries marked with "--" mean that the data could not be obtained from the literature. HMMChu has been trained and tested on the CB480 dataset (a reduced version of CB513), while all other methods have been trained and tested on the CB513 dataset. The average results of seven-fold cross-validation are shown.
Comparative performance of DBNN against other popular methods tested on CB513.
| Method | |||||
| SVM | 73.5 | -- | 0.65 | 0.53 | 0.54 |
| PMSVM | 75.2 | -- | 0.71 | 0.61 | 0.61 |
| SVMpsi | 76.6 | 73.5 | 0.68 | 0.60 | 0.56 |
| JNET | 76.9 | -- | -- | -- | -- |
| YASSPP | 77.8 | 75.1 | 0.58 | 0.64 | 0.71 |
| †SPINE | 76.8 | -- | -- | -- | -- |
| DBNN/ErrSig | 78.1/0.41 | 74.0/0.62 | 0.74/0.01 | 0.64/0.01 | 0.60/0.01 |
| †DBNN/ErrSig | 78.0/0.40 | 74.0/0.62 | 0.74/0.01 | 0.64/0.01 | 0.60/0.01 |
The description of DBNN can be found in Methods. Entries marked with "--" mean that the data could not be obtained from literatures. JNET has been trained and tested on the CB480 dataset (a reduced version of CB513), while all other methods have been trained and tested on the CB513 dataset. Methods marked with "†" have been evaluated using ten-fold cross-validation, while others have been evaluated using seven-fold cross-validation.
Comparative performance of DBNN and consensus methods against other leading methods tested on EVAc6.
| Method | |||||
| Prospect | 71.1 | 68.7 | 0.59 | 0.69 | 0.49 |
| DBNN/ErrSig | 78.8/1.34 | 74.8/1.74 | 0.72/0.03 | 0.64/0.04 | 0.62/0.02 |
| PROF_king | 71.7 | 66.9 | 0.62 | 0.68 | 0.49 |
| DBNN/ErrSig | 77.3/0.86 | 71.9/1.27 | 0.71/0.02 | 0.64/0.03 | 0.57/0.02 |
| SAM-T99 | 77.1 | 74.4 | 0.66 | 0.68 | 0.53 |
| DBNN/ErrSig | 77.3/0.86 | 71.9/1.28 | 0.71/0.02 | 0.64/0.02 | 0.57/0.02 |
| PSIPRED | 77.8 | 75.4 | 0.69 | 0.74 | 0.56 |
| PROFsec | 76.7 | 74.8 | 0.68 | 0.72 | 0.56 |
| PHDpsi | 75.0 | 70.9 | 0.66 | 0.69 | 0.53 |
| DBNN/ErrSig | 77.8/0.79 | 72.4/1.16 | 0.71/0.02 | 0.65/0.02 | 0.58/0.01 |
| SAM-T99 | 76.3 | 72.9 | 0.71 | 0.64 | 0.56 |
| PSIPRED | 75.8 | 72.1 | 0.70 | 0.64 | 0.57 |
| PROFsec | 75.3 | 73.0 | 0.68 | 0.61 | 0.54 |
| PHDpsi | 73.3 | 69.2 | 0.66 | 0.56 | 0.52 |
| PROF_king | 70.7 | 64.9 | 0.63 | 0.57 | 0.50 |
| DBNN/ErrSig | 76.4/1.48 | 72.4/2.06 | 0.73/0.04 | 0.67/0.04 | 0.59/0.03 |
| CM1/ErrSig | 77.2/1.14 | 73.2/1.87 | 0.73/0.04 | 0.66/0.04 | 0.58/0.02 |
| CM2/ErrSig | 77.7/1.17 | 73.4/1.78 | 0.74/0.04 | 0.67/0.04 | 0.60/0.02 |
| CM3/ErrSig | 78.1/1.17 | 74.4/1.76 | 0.75/0.04 | 0.67/0.04 | 0.60/0.02 |
DBNN and the three consensus methods (CM1, CM2, and CM3) developed in this work are compared with other leading methods on five subsets of EVAc6; each comparison is carried out with maximum number of common sequences. The results of the six existing methods, Prospect, PROF_king, SAM-T99, PROFsec, PHDpsi, and PSIPRED, are obtained directly from the EVA website.
Calculated t-values for differences in accuracy scores.
| Method Y | |||||||||
| Method X | PROF_king | SAM-T99 | PSIPRED | PROFsec | PHDpsi | DBNN | CM1 | CM2 | CM3 |
| PROF_king | -- | -4.70 | -3.99 | -3.56 | -1.88 | -4.52 | -6.19 | -6.93 | -6.88 |
| SAM_T99 | -- | 0.50 | 0.93 | -0.16 | -1.41 | -2.09 | -3.02 | ||
| PSIPRED | -0.50 | -- | 0.53 | -0.63 | -2.01 | -2.62 | -3.38 | ||
| PROFsec | -0.93 | -0.53 | -- | -0.94 | -2.87 | -3.22 | -3.72 | ||
| PHDpsi | -2.45 | -2.18 | -2.31 | -- | -2.48 | -4.55 | -5.11 | -5.10 | |
| DBNN | 0.16 | 0.63 | 0.94 | -- | -0.91 | -1.61 | -2.50 | ||
| CM1 | 1.41 | 0.91 | -- | -1.65 | -2.82 | ||||
| CM2 | 1.61 | 1.65 | -- | -1.48 | |||||
| CM3 | 1.48 | -- | |||||||
| PROF_king | -- | -4.05 | -3.89 | -3.80 | -1.99 | -3.69 | -5.30 | -5.66 | -5.86 |
| SAM_T99 | -- | 0.54 | -0.06 | 0.36 | -0.20 | -0.35 | -1.21 | ||
| PSIPRED | -0.54 | -- | -0.62 | -0.19 | -0.97 | -1.22 | -2.57 | ||
| PROFsec | 0.06 | 0.62 | -- | 0.37 | -0.15 | -0.28 | -1.12 | ||
| PHDpsi | -2.43 | -1.77 | -2.93 | -- | -1.67 | -3.30 | -3.30 | -3.82 | |
| DBNN | -0.36 | 0.19 | -0.37 | -- | -0.58 | -0.83 | -1.83 | ||
| CM1 | 0.20 | 0.97 | 0.15 | 0.58 | -- | -0.27 | -2.03 | ||
| CM2 | 0.35 | 1.22 | 0.28 | 0.83 | 0.27 | -- | -2.55 | ||
| CM3 | 1.21 | 1.12 | -- | ||||||
The t-values are calculated for the differences in accuracy scores between "method X" and "method Y" (x-y) tested on EVAc6 subset 5. The descriptions of DBNN, CM1, CM2 and CM3 can be found in the text. Underlined are where calculated t > tabulated t (significant). The tabulated t = 1.67 for α = 0.05 and degree of freedom = 72.