| Literature DB >> 24267585 |
Jesse Eickholt, Jianlin Cheng.
Abstract
BACKGROUND: In recent years, the use and importance of predicted protein residue-residue contacts has grown considerably with demonstrated applications such as drug design, protein tertiary structure prediction and model quality assessment. Nevertheless, reported accuracies in the range of 25-35% stubbornly remain the norm for sequence based, long range contact predictions on hard targets. This is in spite of a prolonged effort on behalf of the community to improve the performance of residue-residue contact prediction. A thorough study of the quality of current residue-residue contact predictions and the evaluation metrics used as well as an analysis of current methods is needed to stimulate further advancement in contact prediction and its application. Such a study will better explain the quality and nature of residue-residue contact predictions generated by current methods and as a result lead to better use of this contact information.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24267585 PMCID: PMC3850995 DOI: 10.1186/1471-2105-14-S14-S12
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance on HARD CASP10 targets
| Acc. Top L/10 (SE) | Acc. Top L/5 (SE) | Acc. Top L (SE) | ||||
|---|---|---|---|---|---|---|
| Method (GroupID) | Long | Medium | Long | Medium | Long | Medium |
| IBGteam [DL] (305) | 0.263 (0.066) | 0.356(0.085) | 0.208 (0.050) | 0.292(0.063) | 0.117 (0.024) | 0.180(0.027) |
| DNcon (222) | 0.244 (0.039) | 0.442(0.072) | 0.207 (0.029) | 0.346(0.056) | 0.128 (0.016) | 0.206(0.034) |
| RandomForest (396) | 0.228 (0.063) | 0.336(0.066) | 0.193 (0.057) | 0.283(0.047) | 0.122 (0.020) | 0.159(0.019) |
| RandomForest (257) | 0.208 (0.067) | 0.336(0.066) | 0.195 (0.091) | 0.283(0.047) | 0.119 (0.091) | 0.159(0.019) |
| RaptorX-Roll (358) | 0.146 (0.034) | 0.412(0.076) | 0.164 (0.028) | 0.344(0.058) | 0.105 (0.026) | 0.271(0.041) |
| PLCT (332) | 0.116 (0.027) | 0.329(0.070) | 0.095(0.016) | 0.275(0.050) | 0.073(0.008) | 0.173(0.024) |
| SVM (81) | 0.087 (0.03) | 0.253(0.061) | 0.090 (0.03) | 0.243(0.051) | 0.069 (0.020) | 0.162(0.023) |
| 1d-rec. NN (125) | 0.075 (0.022) | 0.338(0.076) | 0.067 (0.017) | 0.290(0.055) | 0.046 (0.009) | 0.183(0.031) |
Accuracies for the top L/10, L/5 and L medium and long range contact predictions for 13 hard targets. L is the length of the protein. Estimates for standard error are provided in parenthesises.
Performance on CASP10 targets
| Acc. Top L/10 (SE) | Acc. Top L/5 (SE) | Acc. Top L (SE) | ||||
|---|---|---|---|---|---|---|
| Method (GroupID) | Long | Medium | Long | Medium | Long | Medium |
| RandomForest (396) | 0.356 (0.030) | 0.455(0.027) | 0.314 (0.026) | 0.380(0.024) | 0.175 (0.013) | 0.201(0.013) |
| DNcon (222) | 0.354 (0.027) | 0.457(0.027) | 0.304 (0.022) | 0.381(0.022) | 0.176 (0.012) | 0.218(0.012) |
| IBGteam [DL] (305) | 0.352 (0.029) | 0.422(0.027) | 0.298 (0.025) | 0.355(0.023) | 0.161 (0.013) | 0.197(0.011) |
| RandomForest (257) | 0.347 (0.030) | 0.455(0.027) | 0.298 (0.025) | 0.380(0.024) | 0.172 (0.013) | 0.201(0.013) |
| RaptorX-Roll (358) | 0.331 (0.026) | 0.469(0.022) | 0.287 (0.044) | 0.401(0.020) | 0.183 (0.013) | 0.313(0.013) |
| 1d-rec. NN (125) | 0.252 (0.028) | 0.391(0.029) | 0.209 (0.032) | 0.329(0.025) | 0.110 (0.011) | 0.189(0.013) |
| SVM (81) | 0.216 (0.023) | 0.347(0.025) | 0.192 (0.019) | 0.297(0.022) | 0.120 (0.011) | 0.178(0.011) |
| PLCT (332) | 0.142(0.018) | 0.369(0.027) | 0.123 (0.014) | 0.304(0.021) | 0.086 (0.008) | 0.174(0.011) |
Accuracies for the top L/10, L/5 and L medium and long range contact predictions for 96 CASP10 targets. L is the length of the protein. Estimates for standard error are provided in parenthesises.
Performance on HARD CASP10 targets using neighbourhoods (δ)
| Acc. Top L/10(SE) | Acc. Top L/5 (SE) | ||||
|---|---|---|---|---|---|
| Method | δ | Long | Medium | Long | Medium |
| DNcon (222) | 1 | 0.484 (0.067) | 0.612 (0.069) | 0.438 (0.055) | 0.559 (0.057) |
| IBGteam [DL] (305) | 1 | 0.450 (0.103) | 0.546 (0.104) | 0.395 (0.086) | 0.507 (0.081) |
| RandomForest (396) | 1 | 0.412 (0.081) | 0.505 (0.082) | 0.385 (0.067) | 0.481 (0.058) |
| RandomForest (257) | 1 | 0.377 (0.078) | 0.505 (0.082) | 0.365 (0.066) | 0.481 (0.068) |
| RaptorX-Roll (358) | 1 | 0.349 (0.067) | 0.626 (0.084) | 0.378 (0.062) | 0.591 (0.075) |
| SVM (81) | 1 | 0.251 (0.050) | 0.464 (0.087) | 0.243 (0.051) | 0.440 (0.073) |
| DNcon (222) | 2 | 0.619 (0.081) | 0.726 (0.058) | 0.563 (0.062) | 0.674 (0.052) |
| IBGteam [DL] (305) | 2 | 0.500 (0.112) | 0.635 (0.093) | 0.427 (0.097) | 0.592 (0.074) |
| RandomForest (396) | 2 | 0.527 (0.079) | 0.591 (0.080) | 0.486 (0.065) | 0.569 (0.058) |
| RaptorX-Roll (358) | 2 | 0.464 (0.075) | 0.692 (0.081) | 0.471 (0.068) | 0.672 (0.072) |
| RandomForest (257) | 2 | 0.470 (0.078) | 0.591 (0.080) | 0.456 (0.066) | 0.569 (0.058) |
| SVM (81) | 2 | 0.409 (0.082) | 0.566 (0.086) | 0.371 (0.071) | 0.537 (0.074) |
Accuracies for the top L/10 and L/5 medium and long range contact predictions for 13 hard targets. L is the length of the protein. Estimates for standard error are provided in parenthesises. δ is the size of the neighbourhood.
Performance on CASP10 targets using neighbourhoods (δ)
| Acc. Top L/10(SE) | Acc. Top L/5 (SE) | ||||
|---|---|---|---|---|---|
| Method | δ | Long | Medium | Long | Medium |
| DNcon (222) | 1 | 0.580 (0.032) | 0.674 (0.029) | 0.526 (0.029) | 0.623 (0.026) |
| IBGteam [DL] (305) | 1 | 0.555 (0.036) | 0.648 (0.030) | 0.491 (0.033) | 0.609 (0.027) |
| RandomForest (396) | 1 | 0.534 (0.036) | 0.671 (0.030) | 0.504 (0.032) | 0.628 (0.028) |
| RaptorX-Roll (358) | 1 | 0.529 (0.031) | 0.731 (0.025) | 0.490 (0.030) | 0.680 (0.024) |
| RandomForest (257) | 1 | 0.526 (0.036) | 0.671 (0.030) | 0.484 (0.032) | 0.680 (0.024) |
| SVM (81) | 1 | 0.394 (0.032) | 0.598 (0.033) | 0.365 (0.028) | 0.542 (0.029) |
| DNcon (222) | 2 | 0.663 (0.032) | 0.749 (0.027) | 0.615 (0.029) | 0.720 (0.024) |
| RandomForest (396) | 2 | 0.609 (0.035) | 0.734 (0.027) | 0.577 (0.032) | 0.705 (0.026) |
| IBGteam [DL] (305) | 2 | 0.607 (0.037) | 0.729 (0.027) | 0.555 (0.034) | 0.695 (0.025) |
| RaptorX-Roll (358) | 2 | 0.606 (0.031) | 0.801 (0.023) | 0.540 (0.029) | 0.764 (0.022) |
| RandomForest (257) | 2 | 0.597 (0.035) | 0.734 (0.027) | 0.561 (0.031) | 0.723 (0.026) |
| SVM (81) | 2 | 0.484 (0.034) | 0.681 (0.032) | 0.451 (0.034) | 0.644 (0.029) |
Accuracies for the top L/10 and L/5 medium and long range contact predictions for 96CASP10 targets. L is the length of the protein. Estimates for standard error are provided in parenthesises. δ is the size of the neighbourhood.
Performance on CASP10 targets with clustering and neighbourhoods (δ = 2)
| Top L/10 long range | Top L/5 long range | |||
|---|---|---|---|---|
| Method | Acc. (SE) | Cluster | Acc. (SE) | Cluster |
| DNcon(222) | 0.583(0.030) | 666 | 0.520(0.025) | 1018 |
| RaptorX-Roll(358) | 0.524(0.030) | 596 | 0.476(0.027) | 917 |
| IBGteam [DL] (305) | 0.503(0.037) | 408 | 0.391(0.034) | 662 |
| RandomForest (396) | 0.477(0.035) | 577 | 0.441(0.031) | 895 |
| RandomForest(257) | 0.455(0.034) | 627 | 0.415(0.030) | 907 |
| SVM(81) | 0.416(0.034) | 596 | 0.345(0.027) | 936 |
Accuracies for the top L/10, L/5 and L medium and long range contact predictions for 96 CASP10 targets. L is the length of the protein. Estimates for standard error are provided in parenthesises. δ is the size of the neighbourhood. Cluster count is the number of clusters identified by the method.
Figure 1ROC curve for Top L predictions on CASP10 protein targets.
Figure 2ROC curve for Top L/5 predictions on CASP10 protein targets.
AUC for the top L predictions on CASP10 targets
| Method | AUC |
|---|---|
| IBGteam [DL] (305) | 0.753 |
| RandomForest (396) | 0.718 |
| RaptorX-Roll(358) | 0.710 |
| RandomForest(257) | 0.708 |
| DNcon(222) | 0.701 |
| SVM(81) | 0.620 |
Area under the curve (AUC) for ROC curve calculated using the top L per protein predictions over 96 CASP10 targets. L is the length the protein.
AUC for the top L/5 predictions on CASP10 targets
| Method | AUC |
|---|---|
| IBGteam [DL] (305) | 0.759 |
| RandomForest(257) | 0.754 |
| RandomForest (396) | 0.748 |
| RaptorX-Roll(358) | 0.721 |
| DNcon(222) | 0.719 |
| SVM(81) | 0.658 |
Area under the curve (AUC) for ROC curve calculated using the top L/5 per protein predictions over 96 CASP10 targets. L is the length the protein.
Performance of DN ensembles composed of varying architectures on CASP9 targets
| Acc. Top L/5 (SE) | Acc. Top L (SE) | |||
|---|---|---|---|---|
| Architecture | Long | Medium | Long | Medium |
| 500-500-500-350-1 | 0.174(.012) | 0.245(.014) | 0.113(.006) | 0.144(.008) |
| 750-500-350-1 | 0.162(.012) | 0.231(.013) | 0.101(.006) | 0.137(.007) |
| 500-500-350-1 | 0.182(.012) | 0.24(.013) | 0.122(.006) | 0.150(.007) |
| 500-250-1 | 0.159(.012) | 0.243(.131) | 0.107(.006) | 0.142(.007) |
| 250-250-1 | 0.169(.010) | 0.236(.012) | 0.108(.006) | 0.142(.007) |
Feature groups used in feature assessment and description
| Name | Features included |
|---|---|
| seq | Residue type (hot encoded) |
| atch | Atchley factors |
| bins | The separation in sequence between the residue-residue pair (hot encoded) |
| globs | Contact potentials, relative position and percentage of helix, loop, beta sheet, exposed |
| pssm-ssa | Information from the PSSM and predicted secondary structure and solvent accessibility (hot encoded) |
Performance of a DN ensemble training on different groups of features on CASP9 targets
| Feature set(s) | Acc. Top L/5 (SE) | Acc. Top L (SE) | ||
|---|---|---|---|---|
| Long | Medium | Long | Medium | |
| seq | 0.040(.005) | 0.042(.004) | 0.035(.003) | 0.040(.003) |
| seq-atch | 0.088(.006) | 0.075(.006) | 0.077(.005) | 0.074(.003) |
| seq-atch-bins | 0.078(.005) | 0.092(.006) | 0.077(.005) | 0.085(.005) |
| seq-atch-bins-globals | 0.142(.01) | 0.202(.007) | 0.100(.005) | 0.130(.007) |
| seq-atch-bins-globs-pssm-ssa | 0.157(.011) | 0.221(.013) | 0.106(.006) | 0.132(.007) |
| pssm-atch | 0.168(.012) | 0.236(.014) | 0.110(.006) | 0.130(.007) |
| ALL | 0.182(.012) | 0.240(.013) | 0.122(.006) | 0.150(.007) |