| Literature DB >> 25196432 |
Li Song, Dapeng Li, Xiangxiang Zeng, Yunfeng Wu, Li Guo1, Quan Zou.
Abstract
BACKGROUND: DNA-binding proteins are vital for the study of cellular processes. In recent genome engineering studies, the identification of proteins with certain functions has become increasingly important and needs to be performed rapidly and efficiently. In previous years, several approaches have been developed to improve the identification of DNA-binding proteins. However, the currently available resources are insufficient to accurately identify these proteins. Because of this, the previous research has been limited by the relatively unbalanced accuracy rate and the low identification success of the current methods.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25196432 PMCID: PMC4165999 DOI: 10.1186/1471-2105-15-298
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flow construction of the 188D feature extraction method. For the loop body, the number of physicochemical properties is equivalent to the number of loops.
Figure 2Framework of the ensemble classifier imDC. n represents the number of minority samples, and m stands for the number of majority samples. The loop body is run for iterNum times.
The original dataset and the datasets following the threshold removal
| DNA-binding | Non DNA-binding | Total | |
|---|---|---|---|
|
| 1,353 | 9,361 | 10,714 |
|
| 1,216 | 8,536 | 9,752 |
|
| 1,219 | 8,611 | 9,830 |
|
| 1,220 | 8,653 | 9,873 |
|
| 1,221 | 8,670 | 9,891 |
|
| 1,221 | 8,676 | 9,897 |
|
| 1,223 | 8,685 | 9,908 |
Figure 3Comparison of the accuracy between the ensemble classifier imDC and the other classifiers using each of the thresholds.
Comparison of the F-measure of the ensemble classifier imDC and the other classifiers using each of the thresholds
| KNN | J48 | RF | SVM | Bagging | imDC | |
|---|---|---|---|---|---|---|
|
| 0.774 | 0.808 | 0.820 | 0.813 | 0.820 |
|
|
| 0.779 | 0.816 | 0.823 | 0.815 | 0.821 |
|
|
| 0.775 | 0.818 | 0.825 | 0.815 | 0.824 |
|
|
| 0.774 | 0.819 | 0.822 | 0.815 | 0.825 |
|
|
| 0.774 | 0.814 | 0.823 | 0.815 | 0.823 |
|
|
| 0.779 | 0.817 | 0.823 | 0.815 | 0.824 |
|
Comparison of the AUC value of the ensemble classifier imDC and the other classifiers using each of the thresholds
| KNN | J48 | RF | SVM | Bagging | imDC | |
|---|---|---|---|---|---|---|
|
| 0.543 | 0.539 | 0.624 | 0.496 | 0.688 |
|
|
| 0.544 | 0.575 | 0.615 | 0.496 | 0.679 |
|
|
| 0.537 | 0.585 | 0.631 | 0.495 | 0.690 |
|
|
| 0.533 | 0.578 | 0.621 | 0.496 | 0.674 |
|
|
| 0.533 | 0.574 | 0.618 | 0.496 | 0.669 |
|
|
| 0.549 | 0.579 | 0.617 | 0.495 | 0.679 |
|
Figure 4Comparison of the F-measure of the ensemble classifier imDC and the other classifiers using the 0.4 threshold.
Figure 5Comparison between balanced dataset in SVM and unbalanced dataset in imDC.
Figure 6The accuracy of several feature extraction methods using different thresholds.
Figure 7The accuracy of different datasets using the same ensemble classifier.
The rank of features in the mRMR feature selection
| Order | Fea | Name | Score |
|---|---|---|---|
|
| 23 | Fea23 | 0.017 |
|
| 41 | Fea41 | -0.001 |
|
| 83 | Fea83 | -0.002 |
|
| 9 | Fea9 | -0.002 |
|
| 36 | Fea36 | -0.003 |
|
| 2 | Fea2 | -0.004 |
|
| 7 | Fea7 | -0.003 |
|
| 155 | Fea155 | -0.005 |
|
| 16 | Fea16 | -0.006 |
|
| 115 | Fea115 | -0.005 |
Figure 8An indicator variation diagram of the different features after selection.
A comparison of the three predictor methods
| Precision | Accuracy | ||
|---|---|---|---|
| Positive | Negative | ||
|
| 20% | 80% | 68% |
|
| 0 | 95% | 76% |
|
| 50% | 95% | 86% |