| Literature DB >> 19646261 |
Bent Petersen1, Thomas Nordahl Petersen, Pernille Andersen, Morten Nielsen, Claus Lundegaard.
Abstract
BACKGROUND: Estimation of the reliability of specific real value predictions is nontrivial and the efficacy of this is often questionable. It is important to know if you can trust a given prediction and therefore the best methods associate a prediction with a reliability score or index. For discrete qualitative predictions, the reliability is conventionally estimated as the difference between output scores of selected classes. Such an approach is not feasible for methods that predict a biological feature as a single real value rather than a classification. As a solution to this challenge, we have implemented a method that predicts the relative surface accessibility of an amino acid and simultaneously predicts the reliability for each prediction, in the form of a Z-score.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19646261 PMCID: PMC2725087 DOI: 10.1186/1472-6807-9-51
Source DB: PubMed Journal: BMC Struct Biol ISSN: 1472-6807
Figure 1Graphical overview of the method. Graphic overview of the method used in training of the primary and secondary neural networks. 'PSSM' is a Position-Specific Scoring Matrix. 'Sec. Structure' is the raw output from secondary structure predictions. 'Primary Networks' are an ensemble of artificial neural networks (ANN) and 'B/E Classification' is the raw buried/exposed output from these ANNs. 'Secondary Networks' are also an ensemble of ANNs, trained to predict the relative surface exposure of an amino acid. The last box shows output from the web server.
Evaluated performance for the primary networks.
| Method | % Correct | MCC |
| NetSurfP Classification CB513 | 79.0 | 0.577 |
| Dor and Zhou [ | 78.8 | - |
Evaluation of the best performing ANN ensemble using the evaluation set CB513. The columns are the overall %-correct prediction of buried and exposed amino acids and Matthew's correlation coefficient (MCC). Dor and Zhou gives the performance value published by [22].
Evaluation of NetSurfP and other surface accessibility predictors.
| Method | Exposure | Train | CB513/CB511 | Method |
| Ahmad | ASA | - | 0.48 | ANN |
| Yuan | ASA | - | 0.52 | SVR |
| Nguyan | ASA | - | 0.66 | Two-Stage SVR |
| Real-SPINE | ASA | 0.74 | 0.73 | ANN |
| Real-SPINE | RSA | - | 0.70 | ANN |
| NetSurfP | ASA | 0.75 | 0.72 | ANN |
| NetSurfP | RSA | 0.72 | 0.70 | ANN |
Performances are shown for 5 different approaches to predict absolute and relative (RSA) surface accessibility. Methods included in the benchmark are Ahmad: [5], Yuan: [20], Nguyen: [24], Real-SPINE: [22], NetSurfP: This work. Train gives the training performance, and CB513/CB511 gives the evaluation performance on the CB513 data set. Train performance of the Real-SPINE method and evaluation performances for the Ahmad, Yuan, and Nguyen method are taken from the corresponding publications. ANN = Artificial neural networks, SVR = Support vector regression. Pearson's correlation coefficients (PCC) are shown for all methods based on the absolute surface exposure of an amino acid. Also, PCC values are given for relative surface exposure for the two methods NetSurfP and Real-SPINE.
Figure 2The average error as a function of the predicted reliability. The left panel shows NetSurfP Z-score versus mean error, and the right panel shows the consistency reliability score versus mean error.
Figure 3Histogram of mean error as a function of predicted exposure values. The bars show the histogram for four groups of predictions with high and low reliabilities: "High R" and "low R" for the consistency method and "high Z" and "low Z" for the NetSurfP method, where "high" is the 50% most reliable predictions according to the chosen reliability score, and "low" is the 50% least reliable predictions.
Figure 4Histogram of the number of predicted residues (A: Real-Spine and B: NetSurfP) as a function of the predicted relative exposure value for all residues in the CB511 data set at different cut-offs. The full line shows the calculated (measured) exposure distribution of the full set. The distribution of the 25%, 50%, 75% and 80% most reliably A: Real-Spine predicted residues according to consistency score, and B: NetSurfP predicted residues according to the Z-score, are also shown. Insert shows the number of predicted residues/all predictions in a given threshold as a function of the predicted RSA.
Evaluation of the Real-SPINE and NetSurfP method on subsets of residues from the CB511 dataset predicted with high reliability.
| 10 | 8372 | 0.73 | 0.74 | 0.16 | 0.18 | 0.77 | 0.79 | 0.35 | 0.35 |
| 20 | 16745 | 0.73 | 0.74 | 0.16 | 0.18 | 0.79 | 0.79 | 0.31 | 0.31 |
| 25 | 20931 | 0.73 | 0.74 | 0.17 | 0.19 | 0.79 | 0.79 | 0.30 | 0.30 |
| 50 | 41863 | 0.72 | 0.74 | 0.18 | 0.20 | 0.77 | 0.77 | 0.28 | 0.28 |
| 75 | 62795 | 0.71 | 0.73 | 0.22 | 0.24 | 0.74 | 0.75 | 0.28 | 0.28 |
| 80 | 66981 | 0.71 | 0.73 | 0.23 | 0.25 | 0.73 | 0.74 | 0.28 | 0.28 |
| 90 | 75354 | 0.70 | 0.73 | 0.25 | 0.27 | 0.72 | 0.73 | 0.28 | 0.28 |
| 100 | 83727 | 0.70 | 0.73 | 0.27 | 0.29 | 0.70 | 0.72 | 0.29 | 0.29 |
%Top and N give the percentage and number of residues selected. RSA and ASA give the Pearson's correlation between predicted and target for relative and absolute surface areas, respectively. P-RSA, and M-RSA give the mean predicted and mean measured RSA values, respectively, on the selected subset of residues.
Figure 5Reliability baseline and standard deviation fitting. The reliability is shown as a function of the predicted exposure for the Cull-1764 data set. In grey is shown the fitted reliability baseline and standard deviation. The insert shows the baseline corrected Z-scores as a function of the predicted surface exposure.