| Literature DB >> 23497329 |
Chuanxin Zou1, Jiayu Gong, Honglin Li.
Abstract
BACKGROUND: DNA-binding proteins (DNA-BPs) play a pivotal role in both eukaryotic and prokaryotic proteomes. There have been several computational methods proposed in the literature to deal with the DNA-BPs, many informative features and properties were used and proved to have significant impact on this problem. However the ultimate goal of Bioinformatics is to be able to predict the DNA-BPs directly from primary sequence.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23497329 PMCID: PMC3602657 DOI: 10.1186/1471-2105-14-90
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The overall workflow of the present method. Firstly, the input amino acid sequence is represented numerically by four kinds of features. Secondly, these feature values are transformed to feature descriptor matrices from three different levels. Thirdly, the first round of the evaluation is adopted based on the original descriptor pool and individual SVM models obtained. At last, mRMR-IFS feature selection method and ensemble learning approach are applied as the final evaluation of the optimal SVM model.
Figure 2The count of three kinds of Dipeptide composition , , .
List of the AAIndex indices used in this paper
| 39 | CHOP780202 | Normalized frequency of beta-sheet (Chou-Fasman, 1978b) |
| 56 | CIDH920103 | Normalized hydrophobicity scales for alpha+beta-proteins (Cid et al., 1992) |
| 58 | CIDH920105 | Normalized average hydrophobicity scales (Cid et al., 1992) |
| 86 | FAUJ880109 | Number of hydrogen bond donors (Fauchere et al., 1988) |
| 88 | FAUJ880111 | Positive charge (Fauchere et al., 1988) |
| 95 | FINA910104 | Helix termination parameter at posision j+1 (Finkelstein et al., 1991) |
| 100 | GEIM800104 | Alpha-helix indices for alpha/beta-proteins (Geisow-Roberts, 1980) |
| 102 | GEIM800106 | Beta-strand indices for beta-proteins (Geisow-Roberts, 1980) |
| 139 | KANM800102 | Average relative probability of beta-sheet (Kanehisa-Tsong, 1980) |
| 146 | KLEP840101 | Net charge (Klein et al., 1984) |
| 147 | KRIW710101 | Side chain interaction parameter (Krigbaum-Rubin, 1971) |
| 167 | LIFS790101 | Conformational preference for all beta-strands (Lifson-Sander, 1979) |
| 178 | MEEJ800101 | Retention coefficient in HPLC, pH7.4 (Meek, 1980) |
| 214 | OOBM770102 | Short and medium range non-bonded energy per atom (Oobatake-Ooi, 1977) |
| 229 | PALJ810107 | Normalized frequency of alpha-helix in all-alpha class (Palau et al., 1981) |
| 280 | QIAN880123 | Weights for beta-sheet at the window position of 3 (Qian-Sejnowski, 1988) |
| 299 | RACS770103 | Side chain orientational preference (Rackovsky-Scheraga, 1977) |
| 321 | RADA880108 | Mean polarity (Radzicka-Wolfenden, 1988) |
| 356 | ROSM880102 | Side chain hydropathy, corrected for solvation (Roseman, 1988) |
| 365 | SWER830101 | Optimal matching hydrophobicity (Sweet-Eisenberg, 1983) |
| 399 | ZIMJ680102 | Bulkiness (Zimmerman et al., 1968) |
| 401 | ZIMJ680104 | Isoelectric point (Zimmerman et al., 1968) |
| 422 | AURR980120 | Normalized positional residue frequency at helix termini C4’ (Aurora-Rose, 1998) |
| 431 | MUNV940103 | Free energy in beta-strand conformation (Munoz-Serrano, 1994) |
| 449 | NADH010104 | Hydropathy scale based on self-information values in the two-state model (20% accessibility) (Naderi-Manesh et al., 2001) |
| 451 | NADH010106 | Hydropathy scale based on self-information values in the two-state model (36% accessibility) (Naderi-Manesh et al., 2001) |
| 512 | GUYH850105 | Apparent partition energies calculated from Chothia index (Guy, 1985) |
| 528 | MIYS990104 | Optimized relative partition energies - method C (Miyazawa-Jernigan, 1999) |
Figure 3Definitions of the N-terminal, middle, and C-terminal parts depending on sequence length for SAA method.
Indices used to evaluate the prediction method
| Acc | ( |
| AUC | area under the receiving operating characteristic curve |
| F-score | 2 · |
| Sen | |
| Sp | |
| MCC |
TP (true positive); TN (true negative); FP (false positive); FN (false negative);
Acc, overall accuracy; AUC, area under ROC; Sen, sensitivity; Sp, specificity; F-score, 2×precision×sensitivity/(precision+sensitivity); MCC, Matthews’s correlation coefficient.
Figure 4The performance of different AC features with various values over DNAdset and DNAaset.
The performance of different kinds of feature descriptors
| OAAC | 0.872 | 0.941 | 0.856 | 0.865 | 0.852 | 0.716 | 0.726 | 0.794 | 0.742 | 0.799 | 0.650 | 0.451 |
| DPC | 0.872 | 0.925 | 0.838 | 0.865 | 0.809 | 0.672 | 0.717 | 0.784 | 0.725 | 0.753 | 0.682 | 0.436 |
| SAAC | 0.846 | 0.904 | 0.826 | 0.842 | 0.813 | 0.651 | 0.697 | 0.740 | 0.701 | 0.743 | 0.624 | 0.369 |
| AAIndex-OCTD | 0.845 | 0.905 | 0.824 | 0.828 | 0.825 | 0.651 | 0.743 | 0.782 | 0.729 | 0.766 | 0.664 | 0.452 |
| AAindex-AC | 0.688 | 0.745 | 0.707 | 0.729 | 0.680 | 0.410 | 0.683 | 0.734 | 0.705 | 0.785 | 0.559 | 0.353 |
| AAIndex-SAA | 0.870 | 0.915 | 0.840 | 0.869 | 0.811 | 0.678 | 0.708 | 0.747 | 0.732 | 0.808 | 0.601 | 0.417 |
| PSSM-OCTD | 0.729 | 0.776 | 0.721 | 0.728 | 0.724 | 0.452 | 0.741 | 0.811 | 0.742 | 0.745 | 0.738 | 0.483 |
| PSSM-AC | 0.786 | 0.827 | 0.762 | 0.771 | 0.752 | 0.523 | 0.742 | 0.816 | 0.734 | 0.725 | 0.751 | 0.477 |
| PSSM-SAA | 0.872 | 0.932 | 0.872 | 0.903 | 0.839 | 0.741 | 0.761 | 0.840 | 0.773 | 0.797 | 0.737 | 0.535 |
| S&F-OCTD | 0.723 | 0.801 | 0.737 | 0.714 | 0.779 | 0.493 | 0.719 | 0.770 | 0.711 | 0.726 | 0.694 | 0.411 |
| S&F-AC | 0.745 | 0.799 | 0.729 | 0.756 | 0.690 | 0.446 | 0.717 | 0.771 | 0.701 | 0.690 | 0.723 | 0.413 |
| S&F-SAA | 0.712 | 0.734 | 0.649 | 0.627 | 0.736 | 0.371 | 0.711 | 0.768 | 0.703 | 0.710 | 0.692 | 0.402 |
Figure 5The IFS curves of DNAdset, DNArset and DNAaset.
The performance of feature-selection method and ensemble learning
| mRMR-IFS | 0.940 | 0.973 | 0.940 | 0.964 | 0.917 | 0.881 | 0.789 | 0.864 | 0.793 | 0.819 | 0.766 | 0.575 |
| Ensemble-voting | 0.898 | N/A | 0.900 | 0.905 | 0.892 | 0.797 | 0.789 | N/A | 0.792 | 0.801 | 0.778 | 0.579 |
| Ensemble-stacking | 0.907 | 0.965 | 0.910 | 0.935 | 0.878 | 0.819 | 0.811 | 0.885 | 0.808 | 0.814 | 0.799 | 0.614 |
N/A – not available.
Comparison of the predicted results by our method and some web-servers on DNAiset
| 0.717 | 0.642 | 0.863 | 0.656 | 0.473 | |
| 0.842 | 0.739 | 0.763 | 0.875 | 0.627 | |
| 0.875 | 0.798 | 0.837 | 0.891 | 0.709 | |
| 0.890 | 0.828 | 0.900 | 0.886 | 0.753 |
Figure 6Distribution of the number of each type of features (a total 12 types) in the optimal feature set.