| Literature DB >> 16478545 |
Mikael Bodén1, Zheng Yuan, Timothy L Bailey.
Abstract
BACKGROUND: The structure of proteins may change as a result of the inherent flexibility of some protein regions. We develop and explore probabilistic machine learning methods for predicting a continuum secondary structure, i.e. assigning probabilities to the conformational states of a residue. We train our methods using data derived from high-quality NMR models.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16478545 PMCID: PMC1386714 DOI: 10.1186/1471-2105-7-68
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Cross-validated density prediction accuracy of PNN and CPNN models. Average KL divergence for probabilistic and cascaded probabilistic neural network models predicting continuum 3- and 8-class secondary structure from PSI-BlAST encoded sequence data. Window size and number of hidden nodes are varied. All predictions are for 10-fold cross-validation on the training set (set-174). When standard errors are given in parentheses, the predicted value is the mean of five randomized repeats of cross-validation. in parenthesis.
| 0 | 5 | 10 | 15 | 20 | 25 | 30 | 30 | ||
| 3 | 11 | 0.59 | 0.57 | 0.53 | 0.52 | 0.52 | 0.52 | 0.51 | |
| 13 | 0.58 | 0.57 | 0.53 | 0.52 | 0.51 | 0.51 | 0.51 | ||
| 15 | 0.58 | 0.56 | 0.53 | 0.52 | 0.51 | 0.50 | 0.49 | 0.47 (0.002) | |
| 17 | 0.58 | 0.56 | 0.54 | 0.52 | 0.51 | 0.51 | 0.51 | ||
| 19 | 0.59 | 0.56 | 0.54 | 0.52 | 0.51 | 0.52 | 0.51 | ||
| 8 | 11 | 0.97 | 0.97 | 0.94 | 0.91 | 0.92 | 0.89 | 0.90 | |
| 13 | 0.97 | 0.96 | 0.92 | 0.90 | 0.90 | 0.89 | 0.89 | ||
| 15 | 0.97 | 0.95 | 0.92 | 0.90 | 0.90 | 0.89 | 0.88 | 0.84 (0.002) | |
| 17 | 0.98 | 0.96 | 0.93 | 0.91 | 0.90 | 0.89 | 0.89 | ||
| 19 | 0.98 | 0.97 | 0.94 | 0.91 | 0.92 | 0.89 | 0.90 | ||
Cross-validated density prediction accuracy of NBDP models. Average KL divergence for the Naive Bayes' Density Predictor is shown for the 3- and 8-class prediction tasks with varying sequence window sizes. The residues in the window were described using the amino acid identity method. All predictions are for 10-fold cross-validation on the training set (set-174). Best results are shown in bold.
| 5 | 0.77 | 1.20 |
| 7 | 0.75 | |
| 9 | 0.75 | |
| 11 | 1.20 | |
| 13 | 0.75 | 1.22 |
| 15 | 0.76 | 1.25 |
| 17 | 0.78 | 1.28 |
| 19 | 0.79 | 1.31 |
Figure 1Example 3-class continuum secondary structure predictions. The 3-class predictions of the best NBDP and CPNN models for positions 1–50 of protein PDB:1RFA are plotted. The target (known) probabilities are plotted as a dotted black line. The dashed red line is the NBDP predictions and the solid blue line is the CPNN predictions.
Cross-validated classification accuracy of all models. Average accuracy of categorical prediction in the 3- and 8-class problems is given as measured by the accuracy metric Q, the Matthews correlation, r(), and SOV. All predictions are for 10-fold cross-validation on the training set (set-174). When standard errors are given in parentheses, the predicted value is the mean of five randomized repeats of cross-validation. The best results are shown in bold.
| SOV | ||||||
| NBDP | 61.2 | 0.40 | 0.34 | 0.41 | 52.9 | 46.1 |
| PNN | 76.4 (0.09) | 0.68 | 0.62 | 0.57 | 67.4 | 61.4 (0.09) |
| CPNN | 0.58 | |||||
| CCNN | 77.2 (0.08) | 72.8 | 62.5 (0.15) | |||
Density prediction accuracy (KL divergence) for structurally ambivalent residues. Average KL divergence of prediction of continuum secondary structure for residues that have a structural ambivalence equal to or exceeding an entropy of 0.0 (all residues), 0.3 and 0.5. "CV": average (standard error) of five randomized repeats of 10-fold cross-validation on the training set (set-174). "test": average error on the test dataset (set-286).
| 3-class | PNN | 0.49 (0.002) | 0.52 | 0.53 (0.002) | 0.59 | 0.52 (0.003) | 0.54 |
| CPNN | 0.47 (0.002) | 0.50 | 0.53 (0.003) | 0.57 | 0.53 (0.003) | 0.53 | |
| CCNN | 0.48 (0.002) | 0.51 | 0.58 (0.002) | 0.62 | 0.59 (0.004) | 0.58 | |
| 8-class | PNN | 0.88 (0.001) | 1.01 | 1.07 (0.003) | 1.26 | 1.07 (0.004) | 1.13 |
| CPNN | 0.84 (0.002) | 0.98 | 1.03 (0.004) | 1.22 | 0.98 (0.008) | 1.15 | |
| CCNN | 0.87 (0.003) | 0.99 | 1.12 (0.004) | 1.31 | 1.10 (0.010) | 1.24 | |
Classification accuracy (Q3) for structurally ambivalent residues. Average accuracy as measured by Q3 of 3-class categorical prediction of residues that have a structural ambivalence equal to or exceeding an entropy of 0.0 (all residues), 0.3 and 0.5. "CV": average (standard error) of five randomized repeats of 10-fold cross-validation on the training set (set-174). "test": average error on the test dataset (set-286).
| 3-class | PNN | 76.4 (0.09) | 77.0 | 55.5 (0.18) | 50.3 | 50.2 (0.18) | 49.0 |
| CPNN | 77.3 (0.07) | 77.8 | 55.7 (0.22) | 50.6 | 50.4 (0.24) | 49.0 | |
| CCNN | 77.2 (0.08) | 77.8 | 55.4 (0.17) | 51.5 | 50.2 (0.31) | 50.0 | |
Figure 2KL divergence as a function of test dataset residue entropy. The dashed red and solid blue lines show the KL divergence of predictions on the test dataset (set-286) made by the CCNN and CPNN models, respectively. Residues are binned by secondary structure entropy, and the mean KL divergence of residues in a bin is plotted at the midpoint of the bin. Error bars show plus and minus one standard error around each mean. The numbers of residues in the bins for the 3-class problem are (in order of increasing entropy) 26357, 948, 839, 525, 14, 1036, 489, 4 and 2. For the 8-class problem bin occupancies are 25657, 1878, 777, 1728, 127, 43, 4 and 0.
Figure 3The architecture of the Cascaded Probabilistic Neural Network.