| Literature DB >> 19008251 |
Jiansheng Wu1, Hongde Liu, Xueye Duan, Yan Ding, Hongtao Wu, Yunfei Bai, Xiao Sun.
Abstract
MOTIVATION: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical-chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19008251 PMCID: PMC2638931 DOI: 10.1093/bioinformatics/btn583
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
The prediction performance of the RF model based on various features. The prediction system was evaluated by nested cross-validation and the threshold for sample classification is 1/6
| Features | ACC±SD(%) | SE±SD(%) | PR±SD(%) | SP±SD(%) | MCC |
|---|---|---|---|---|---|
| A | 88.93±0.19 | 76.29±1.40 | 63.39±0.79 | 91.46±0.22 | 0.633 |
| A+B | 89.69±0.41 | 76.58±1.94 | 66.51±1.28 | 92.31±0.35 | 0.652 |
| A+C | 90.46±0.44 | 75.64±0.65 | 69.63±2.36 | 93.41±0.59 | 0.668 |
| A+B+C | 91.41±0.49 | 76.57±1.69 | 73.16±2.78 | 94.38±0.76 | 0.70 |
A: PSSMs; B: SS; C: OBVs of amino acids.
Fig. 1.The expected prediction accuracy and the fraction of sequences with each RI by RF. For example, 34.2% of all samples have RI=10 and of these samples 97.63% are predicted correctly.
Fig. 2.Performance comparisons of ROC graphs with other methods. (A) Both classifiers were preformed on the same dataset DBP-374 with the same features ‘PSSMS+SS+OBVS’. RF: performance from the features (A+B+C) of Table 2; SVM: a SVM-based method evaluated by nested cross-validation. (B) all classifiers were tested on the same testing dataset TS75. The predictors have the following AUC value: BindN 0.782, Ho et al. 0.843 and our RF model 0.855.
Performance comparisons with the methods in DP-Bind. All classifiers were tested on the same testing dataset TS75, and 4.5 Å was designated as the cutoff distance in the definition of a binding residue
| Classifiers | ACC(%) | SE(%) | PR(%) | SP(%) | MCC |
|---|---|---|---|---|---|
| SVM | 75.31 | 68.40 | 23.05 | 76.04 | 0.290 |
| KLR | 78.19 | 67.22 | 25.45 | 79.34 | 0.315 |
| PLR | 76.49 | 64.06 | 23.23 | 77.79 | 0.279 |
| MAJ | 77.98 | 67.85 | 25.35 | 79.04 | 0.316 |
| RF | 80.47 | 67.16 | 27.69 | 81.84 | 0.341 |
aSVM, KLR and PLR are the SVM, KLR and PLR predictors in DP-Bind, respectively; MAJ is the majority consensus prediction of the three predictors.