| Literature DB >> 26307973 |
Viola Volpato1,2, Badr Alshomrani3,4, Gianluca Pollastri5,6.
Abstract
Intrinsically-disordered regions lack a well-defined 3D structure, but play key roles in determining the function of many proteins. Although predictors of disorder have been shown to achieve relatively high rates of correct classification of these segments, improvements over the the years have been slow, and accurate methods are needed that are capable of accommodating the ever-increasing amount of structurally-determined protein sequences to try to boost predictive performances. In this paper, we propose a predictor for short disordered regions based on bidirectional recurrent neural networks and tested by rigorous five-fold cross-validation on a large, non-redundant dataset collected from MobiDB, a new comprehensive source of protein disorder annotations. The system exploits sequence and structural information in the forms of frequency profiles, predicted secondary structure and solvent accessibility and direct disorder annotations from homologous protein structures (templates) deposited in the Protein Data Bank. The contributions of sequence, structure and homology information result in large improvements in predictive accuracy. Additionally, the large scale of the training set leads to low false positive rates, making our systems a robust and efficient way to address high-throughput disorder prediction.Entities:
Keywords: BRNN; MobiDB; PDB; protein disorder
Mesh:
Substances:
Year: 2015 PMID: 26307973 PMCID: PMC4581330 DOI: 10.3390/ijms160819868
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Result of Sensitivity (Sens), Specificity (Spec) , precision (Prec), accuracy (Acc) and Matthews’ correlation coefficient (MCC) on training sets for all of our systems.
| Predictor | Sens | Spec | Prec | Acc | MCC |
|---|---|---|---|---|---|
| MSA | 0.867 | 0.847 | 0.268 | 0.857 | 0.430 |
| MSA-SS-SA | 0.534 | 0.983 | 0.671 | 0.732 | 0.568 |
| MSA-Templ | 0.584 | 0.983 | 0.692 | 0.783 | 0.615 |
| MSA-SS-SA-Templ | 0.606 | 0.982 | 0.684 | 0.794 | 0.622 |
| NN | 0.698 | 0.877 | 0.220 | 0.632 | 0.377 |
Results of Sensitivity (Sens), Specificity (Spec), Precision (Prec), Accuracy (Acc) and MCC on the test set for all of our systems.
| Predictor | Sens | Spec | Prec | Acc | MCC |
|---|---|---|---|---|---|
| MSA | 0.818 | 0.871 | 0.284 | 0.845 | 0.432 |
| MSA-SS-SA | 0.535 | 0.982 | 0.657 | 0.738 | 0.570 |
| MSA-Templ | 0.582 | 0.982 | 0.676 | 0.782 | 0.605 |
| MSA-SS-SA-Templ | 0.603 | 0.981 | 0.672 | 0.792 | 0.615 |
| NN | 0.693 | 0.847 | 0.218 | 0.628 | 0.372 |
Results of AUC_ROC and AUC_PR on the test set for all systems.
| Predictor | AUC_ROC | AUC_PR |
|---|---|---|
| MSA | 0.919 | 0.634 |
| MSA-SS-SA | 0.913 | 0.614 |
| MSA-Templ | 0.920 | 0.658 |
| MSA-SS-SA-Templ | 0.925 | 0.661 |
| NN | 0.914 | 0.617 |
Figure 1Receiver-operating characteristic curves for all of our systems on X-ray test set data.
Figure 2Precision-recall curves for all of our systems on X-ray test set data.
Figure 3Receiver-operating characteristic curve for X-ray test set data showing the FPR in the region from 0% to 14% false positives for all of our methods. The vertical line represents the decision thresholds corresponding to a predicted 5% FPR.
The number of inputs per component for a given residue in each of the six systems. SS, secondary structure; SA, solvent accessibility; Templ, template.
| System | Input Component | ||||
|---|---|---|---|---|---|
| 21 | 0 | 0 | 0 | 21 | |
| 21 | 3 | 4 | 0 | 28 | |
| 21 | 0 | 0 | 3 | 24 | |
| 21 | 3 | 4 | 3 | 31 | |
Figure 4Distribution of the area under the ROC curve as a function of sequence identity of the examples to the best template that could be found. Black bars represent template-based results (MSA-Templ) and white bars sequence-based results (MSA) (top); The number of proteins with templates within given ranges of sequence identity. Black bins represent average sequence identity; grey bins identity to the best template (bottom).
Results for Sens, Spec and AUC on X-ray test sets for all of our systems and the top-ranking methods as reported in Walsh et al. (2012) [34].
| Predictor | Sen | Spec | AUC_ROC |
|---|---|---|---|
| MSA-SS-SA-Templ | 0.603 | 0.981 | 0.925 |
| MSA-Templ | 0.582 | 0.982 | 0.920 |
| MSA-SS-SA | 0.535 | 0.982 | 0.913 |
| MSA | 0.818 | 0.871 | 0.919 |
| CSpritz | 0.796 | 0.850 | 0.899 |
| ESpritz | 0.773 | 0.856 | 0.891 |
| SSpritzP | 0.765 | 0.870 | 0.889 |
| ESpritzP | 0.775 | 0.853 | 0.888 |
| MULTICOM | 0.820 | 0.804 | 0.888 |
| PONDR-FIT | 0.692 | 0.867 | 0.861 |
| IUPred(short) | 0.540 | 0.949 | 0.847 |
| Disopred | 0.565 | 0.939 | 0.839 |
Performances on the termini-trimmed targets obtained by all our versions and by top five CASP10 methods, derived from Monastyrskyy et al. (2014) [55], according to the MCC and AUC scores.
| Predictor | MCC | AUC_ROC |
|---|---|---|
| MSA-SS-SA + templ95_94_internal | 0.480 | 0.873 |
| MSA-SS-SA + templ95_85_internal | 0.476 | 0.869 |
| MSA-templ95_94_internal | 0.434 | 0.865 |
| MSA-templ95_85_internal | 0.426 | 0860 |
| MSA-SS-SA + templ50_85_internal | 0.405 | 0.851 |
| DISOPRED3 | 0.405 | 0.850 |
| MSA-SS-SA + templ50_94_internal | 0.400 | 0.857 |
| MSA-SS-SA_94 | 0.393 | 0.845 |
| MSA-SS-SA_85 | 0.389 | 0.840 |
| MSA_94 | 0.377 | 0.831 |
| Prods-CNF | 0.375 | 0.865 |
| Biomine_dr_mixed | 0.370 | 0.850 |
| MSA_85 | 0.368 | 0.821 |
| MSA-templ50_94_internal | 0.345 | 0.833 |
| MSA-templ50_85_internal | 0.334 | 0.831 |
| DisMeta | 0.325 | 0.625 |
| Biomine_dr_pdb_c | 0.315 | 0.850 |