| Literature DB >> 27907159 |
Xin Ma1, Jing Guo2, Xiao Sun2.
Abstract
DNA-binding proteins are fundamentally important in cellular processes. Several computational-based methods have been developed to improve the prediction of DNA-binding proteins in previous years. However, insufficient work has been done on the prediction of DNA-binding proteins from protein sequence information. In this paper, a novel predictor, DNABP (DNA-binding proteins), was designed to predict DNA-binding proteins using the random forest (RF) classifier with a hybrid feature. The hybrid feature contains two types of novel sequence features, which reflect information about the conservation of physicochemical properties of the amino acids, and the binding propensity of DNA-binding residues and non-binding propensities of non-binding residues. The comparisons with each feature demonstrated that these two novel features contributed most to the improvement in predictive ability. Furthermore, to improve the prediction performance of the DNABP model, feature selection using the minimum redundancy maximum relevance (mRMR) method combined with incremental feature selection (IFS) was carried out during the model construction. The results showed that the DNABP model could achieve 86.90% accuracy, 83.76% sensitivity, 90.03% specificity and a Matthews correlation coefficient of 0.727. High prediction accuracy and performance comparisons with previous research suggested that DNABP could be a useful approach to identify DNA-binding proteins from sequence information. The DNABP web server system is freely available at http://www.cbi.seu.edu.cn/DNABP/.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27907159 PMCID: PMC5132331 DOI: 10.1371/journal.pone.0167345
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Workflow of DNABP
Comparison of the performances of various features using the RF algorithm based on Mainset with five-fold cross-validation
| Feature | ACC | SE | SP | MCC |
|---|---|---|---|---|
*The RF-based method with the best parameter (ntree = 1000, mtry = 20)
Fig 2The IFS curve showing MCC values plotted against feature numbers.
The maximum MCC value was 0.727 when the top 64 features were selected.
The optimal 64 features for the prediction of DNA-binding proteins
| Rank | Feature | p-value |
|---|---|---|
The performance of DNABP, enDNA-Port, iDNA-Prot|dis and nDNA-Prot based on the Testset
| Method | ACC | SE | SP | MCC |
|---|---|---|---|---|
Comparison of the performances of DNABP and enDNA-Prot based on various test dataset
| Model | Test dataset | ACC | SE | SP | MCC |
|---|---|---|---|---|---|
*The results are obtained from reference [11]
Fig 3(a) Feature distribution for the 64 optimal features. (b) The selection proportion of each type of feature.
Fig 4(a) Physicochemical property distribution of the 38 PSSM-PP features that were selected in the optimal feature set. (b) The type of amino acid distribution used to construct the 38 PSSM-PP features that were selected in the optimal feature set.
Fig 5(a) Physicochemical property distribution used to construct the 23 PHY features that were selected in the optimal feature set. (b) Distribution of the three descriptors used to construct the 23 PHY features that were selected in the optimal feature set.
Comparison of the performances of various dataset using the RF algorithm based on 292 features with five-fold cross-validation
| Dataset | ACC | SE | SP | MCC |
|---|---|---|---|---|