| Literature DB >> 23710446 |
Tong-Hui Zhao1, Min Jiang, Tao Huang, Bi-Qing Li, Ning Zhang, Hai-Peng Li, Yu-Dong Cai.
Abstract
With a large number of disordered proteins and their important functions discovered, it is highly desired to develop effective methods to computationally predict protein disordered regions. In this study, based on Random Forest (RF), Maximum Relevancy Minimum Redundancy (mRMR), and Incremental Feature Selection (IFS), we developed a new method to predict disordered regions in proteins. The mRMR criterion was used to rank the importance of all candidate features. Finally, top 128 features were selected from the ranked feature list to build the optimal model, including 92 Position Specific Scoring Matrix (PSSM) conservation score features and 36 secondary structure features. As a result, Matthews correlation coefficient (MCC) of 0.3895 was achieved on the training set by 10-fold cross-validation. On the basis of predicting results for each query sequence by using the method, we used the scanning and modification strategy to improve the performance. The accuracy (ACC) and MCC were increased by 4% and almost 0.2%, respectively, compared with other three popular predictors: DISOPRED, DISOclust, and OnD-CRF. The selected features may shed some light on the understanding of the formation mechanism of disordered structures, providing guidelines for experimental validation.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23710446 PMCID: PMC3654632 DOI: 10.1155/2013/414327
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1The IFS curve showing the Matthews correlation coefficient (MCC) against the number of features. The details were given in Online Supporting Information S5. With the top 128 features, the MCC on training set by 10-fold cross-validation takes the peak 0.3895.
Figure 2The distribution of feature types and amino acid sites in optimal feature subset. The histograms show the number of each type and each site of features in optimal feature subset. In (a), there are 92 PSSM features and 36 secondary structure features. (b) provides the site distributions of the features in the optimal feature set.
Figure 3The distribution of amino acid compositions and sites on PSSM conservation feature. The histograms reveal the types and site distributions of PSSM features in the optimal feature set. (a) indicates the effects on prediction of mutations to 20 different amino acids. (b) provides the site distributions of the PSSM features in the optimal feature set.
Figure 4The distribution of secondary structure types and amino acid sites on secondary structure feature. The histograms give the types and site distributions of secondary structure features in the final optimal feature set. (a) indicates the effects on prediction of three different types of secondary structures: coil, strand, and helix. (b) provides the site distributions of the secondary structure features in the optimal feature set.
The evaluation of prediction result on independent test set by different methods.
| Method | Accuracy | Matthews correlation coefficient | Sensitivity | Specificity |
|---|---|---|---|---|
| Before scanning | 0.7028 | 0.2791 | 0.7189 | 0.6281 |
| After scanning | 0.7508 | 0.3304 | 0.7806 | 0.6118 |
| DISOPRED | 0.7173 | 0.3285 | 0.7239 | 0.6864 |
| DISOclust | 0.6650 | 0.3105 | 0.6453 | 0.7570 |
| OnD-CRF | 0.6562 | 0.3228 | 0.6265 | 0.7941 |