| Literature DB >> 30135358 |
Yumeng Liu1, Xiaolong Wang2, Bin Liu3.
Abstract
Accurate prediction of intrinsically disordered proteins/regions is one of the most important tasks in bioinformatics, and some computational predictors have been proposed to solve this problem. How to efficiently incorporate the sequence-order effect is critical for constructing an accurate predictor because disordered region distributions show global sequence patterns. In order to capture these sequence patterns, several sequence labelling models have been applied to this field, such as conditional random fields (CRFs). However, these methods suffer from certain disadvantages. In this study, we proposed a new computational predictor called IDP⁻CRF, which is trained on an updated benchmark dataset based on the MobiDB database and the DisProt database, and incorporates more comprehensive sequence-based features, including PSSMs (position-specific scoring matrices), kmer, predicted secondary structures, and relative solvent accessibilities. Experimental results on the benchmark dataset and two independent datasets show that IDP⁻CRF outperforms 25 existing state-of-the-art methods in this field, demonstrating that IDP⁻CRF is a very useful tool for identifying IDPs/IDRs (intrinsically disordered proteins/regions). We anticipate that IDP⁻CRF will facilitate the development of protein sequence analysis.Entities:
Keywords: PSSMs; conditional random fields (CRFs); intrinsically disordered proteins/regions; kmer; relative solvent accessibility; secondary structure
Mesh:
Substances:
Year: 2018 PMID: 30135358 PMCID: PMC6164615 DOI: 10.3390/ijms19092483
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1The performance of IDP–CRF (intrinsically disordered protein–conditional random field) and three classification-based predictors trained with different ratios of disordered residues and ordered residues. These three classification-based predictors include a RF (random forest) predictor, an ANN (artificial neural network) predictor and an SVM (support vector machine) predictor. MCC represents Matthew’s correlation coefficient performance metrics.
Performance comparison of IDP–CRF (intrinsically disordered protein–conditional random field) and three classification-based predictors by using 5-fold cross-validation.
| Methods | Ratio a | Sn b | Sp c | ACC d | MCC e |
|---|---|---|---|---|---|
| IDP–CRF | 1:2 | 0.637 | 0.910 | 0.774 | 0.462 |
| RF | 1:2 | 0.524 | 0.928 | 0.726 | 0.416 |
| SVM | 1:2 | 0.543 | 0.896 | 0.720 | 0.366 |
| ANN | 1:2 | 0.537 | 0.897 | 0.717 | 0.363 |
a Represents the ratio of disordered residues and ordered residues in training dataset. b Represents sensitivity. c Represents specificity. d Represents balanced accuracy. e Represents Matthew’s correlation coefficient.
Figure 2A schematic view of protein 3H2YA with IDRs predicted by IDP–CRF and three classification-based predictors, where the residues with red color represent disordered residues and those with yellow color represent ordered residues. (a) IDRs predicted by IDP–CRF are: (1, 32). (b) Actual IDRs are: (1, 55) and (199, 202). (c) IDRs predicted by the RF predictor are: (1, 3) and (365, 368). (d) IDRs predicted by the SVM predictors are: (1, 3), (16, 21), (28, 29), (31, 31), (170, 172) and (185, 185). (e) IDRs predicted by the ANN predictor are: (1, 4), (16, 21), (28, 29), (31, 31), (34, 34), (170, 171) and (314, 314). These curly braces represent the position intervals of the IDRs in the protein.
Figure 3A schematic view of protein 2ODKA with IDRs predicted by IDP–CRF and three classification-based predictors, where the residues with red color represent disordered residues and those with yellow color represent ordered residues. (a) IDRs predicted by IDP–CRF are: (1, 4) and (49, 85). (b) Actual IDR is: (52, 85). (c) IDRs predicted by the RF predictor are: (1, 5), (62, 65), (75, 75), (77, 77) and (81, 85). (d) IDRs predicted by the SVM predictors are: (1, 2), (52, 52), (62, 62), (77, 78), (81, 81) and (83, 84). (e) IDRs predicted by the ANN predictor are: (1, 4), (48, 49), (52, 52), (62, 62), (77, 78) and (83, 84). These curly braces represent the position intervals of the IDRs in the protein.
Figure 4A schematic view of protein 4AD4A with IDRs predicted by IDP–CRF and three classification-based predictors, where the residues with red color represent disordered residues and those with yellow color represent ordered residues. (a) IDRs predicted by IDP–CRF are: (1, 26) and (376, 380). (b) Actual IDRs are: (1, 30), (380, 380). (c) IDRs predicted by the RF predictor are: (1, 3), (13, 31), (69, 81), (132, 133), (236, 236), (346, 346) and (377, 380). (d) IDRs predicted by the SVM predictors are: (11, 31), (54, 55), (65, 69), (72, 75), (78, 82), (88, 88), (97, 99), (104, 104), (337, 337), (346, 346) and (376, 380). (e) IDRs predicted by the ANN predictor are: (10, 31), (33, 33), (53, 54), (65, 66), (68, 68), (72, 76), (78, 82), (86, 86), (88, 88), (97, 99), (129, 129), (201, 201), (207, 207), (260, 260), (337, 338), (342, 342), (344, 344), (346, 347) and (376, 380). These curly braces represent the position intervals of the IDRs in the protein.
The performance comparison of different predictors on independent dataset MxD494.
| Predictor a | Sn | Sp | ACC | MCC | Rank | |
|---|---|---|---|---|---|---|
| ACC | MCC | |||||
| IDP–CRF | 0.680 | 0.821 | 0.750 | 0.460 | 2 | 1 |
| MFDp [ | 0.746 | 0.768 | 0.757 | 0.451 | 1 | 2 |
| MD [ | 0.673 | 0.813 | 0.743 | 0.444 | 3 | 3 |
| PONDR-FIT [ | 0.631 | 0.821 | 0.726 | 0.419 | 6 | 4 |
| DISOPRED2 [ | 0.647 | 0.800 | 0.724 | 0.406 | 7 | 5 |
| IUPred-long [ | 0.581 | 0.841 | 0.711 | 0.405 | 8 | 6 |
| PONDR VSL2B [ | 0.774 | 0.698 | 0.736 | 0.401 | 4 | 7 |
| OnD–CRF b [ | 0.752 | 0.711 | 0.732 | 0.396 | 5 | 8 |
| IUPred-short [ | 0.522 | 0.866 | 0.694 | 0.389 | 10 | 9 |
| RONN [ | 0.664 | 0.754 | 0.709 | 0.368 | 9 | 10 |
| NORSnet [ | 0.532 | 0.829 | 0.681 | 0.347 | 11 | 11 |
| DisEMBL-R [ | 0.316 | 0.936 | 0.626 | 0.323 | 15 | 12 |
| DISpro [ | 0.303 | 0.940 | 0.622 | 0.318 | 16 | 13 |
| Ucon [ | 0.554 | 0.787 | 0.671 | 0.313 | 12 | 14 |
| Spritz [ | 0.494 | 0.812 | 0.653 | 0.293 | 14 | 15 |
| FoldIndex [ | 0.602 | 0.717 | 0.660 | 0.278 | 13 | 16 |
| DisEMBL-H [ | 0.435 | 0.792 | 0.614 | 0.216 | 17 | 17 |
| PROFbval [ | 0.835 | 0.387 | 0.611 | 0.196 | 18 | 18 |
| GlobPlot [ | 0.353 | 0.826 | 0.590 | 0.182 | 19 | 19 |
| DisEMBL-C [ | 0.760 | 0.414 | 0.587 | 0.150 | 20 | 20 |
a The results of the 18 compared predictors (MFDp, MD, PONDR-FIT, DISOPRED2, IUPred-long, PONDR VSL2B, IUPred-short, RONN, NORSnet, DisEMBL-R, DISpro, Ucon, Spritz, FoldIndex, DisEMBL-H, PROFbval, GlobPlot, DisEMBL-C) are obtained from [40]. b The results of OnD–CRF are acquired from web-server.
The performance comparison of different predictors on independent dataset SL329.
| Predictor a | Sn | Sp | ACC | MCC | Rank | |
|---|---|---|---|---|---|---|
| ACC | MCC | |||||
| IDP–CRF | 0.75 | 0.88 | 0.817 | 0.64 | 1 | 2 |
| SPOT-disorder [ | 0.67 | 0.96 | 0.815 | 0.67 | 2 | 1 |
| SPINE-D [ | 0.78 | 0.85 | 0.815 | 0.63 | 2 | 3 |
| DISOPRED3 [ | - | - | 0.795 | 0.61 | 4 | 4 |
| DISOPRED2 [ | 0.69 | 0.90 | 0.795 | 0.59 | 4 | 5 |
| OnD–CRF b [ | 0.79 | 0.80 | 0.793 | 0.58 | 6 | 6 |
| MD [ | 0.66 | 0.89 | 0.775 | 0.58 | 7 | 6 |
| PONDR-FIT [ | 0.61 | 0.91 | 0.760 | 0.55 | 8 | 8 |
| IUPred-long [ | 0.60 | 0.92 | 0.760 | 0.55 | 8 | 8 |
| MFDp [ | 0.88 | 0.62 | 0.750 | 0.51 | 11 | 10 |
| DISOClust [ | 0.81 | 0.70 | 0.755 | 0.51 | 10 | 10 |
| NORSnet [ | 0.54 | 0.92 | 0.730 | 0.51 | 12 | 10 |
| IUPred-short [ | 0.50 | 0.94 | 0.720 | 0.50 | 13 | 13 |
| Ucon [ | 0.59 | 0.81 | 0.700 | 0.42 | 14 | 14 |
| DisEMBL [ | - | - | 0.660 | 0.40 | 16 | 15 |
| Dispro [ | 0.28 | 0.99 | 0.635 | 0.40 | 18 | 15 |
| PONDR VL-XT [ | 0.59 | 0.78 | 0.685 | 0.38 | 15 | 17 |
| Espritz [ | - | - | 0.605 | 0.35 | 19 | 18 |
| PROFbval [ | - | - | 0.648 | 0.30 | 17 | 19 |
a The results of the 17 compared predictors (SPOT-disorder, SPINE-D, DISOPRED3, DISOPRED2, MD, PONDR-FIT, IUPred-long, MFDp, DISOClust, NORSnet, IUPred-short, Ucon, DisEMBL, Dispro, PONDR VL-XT, Espritz, PROFbval) are obtained from [12,13]. b The results of OnD–CRF are acquired from web-server.