| Literature DB >> 35799660 |
Die Chen1, Hua Zhang1, Zeqi Chen1, Bo Xie1, Ye Wang1.
Abstract
The interaction between DNA and protein is vital for the development of a living body. Previous numerous studies on in silico identification of DNA-binding proteins (DBPs) usually include features extracted from the alignment-based (pseudo) position-specific scoring matrix (PSSM), leading to limited application due to its time-consuming generation. Few researchers have paid attention to the application of pretrained language models at the scale of evolution to the identification of DBPs. To this end, we present comprehensive insights into a comparison study on alignment-based PSSM and pretrained evolutionary scale modeling (ESM) representations in the field of DBP classification. The comparison is conducted by extracting information from PSSM and ESM representations using four unified averaging operations and by performing various feature selection (FS) methods. Experimental results demonstrate that the pretrained ESM representation outperforms the PSSM-derived features in a fair comparison perspective. The pretrained feature presentation deserves wide application to the area of in silico DBP identification as well as other function annotation issues. Finally, it is also confirmed that an ensemble scheme by aggregating various trained FS models can significantly improve the classification performance of DBPs.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35799660 PMCID: PMC9256349 DOI: 10.1155/2022/5847242
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.809
Summary of features designed in this comparative study in which d = 20 for PSSMS and PSSMR representations and d = 1028 for ESM representation.
| Index | Description about the feature category | Abbreviation | #Features |
|---|---|---|---|
| 1 | Average representation over all residue-level feature vectors | Avg |
|
| 2 | Average representation over | Avg | 5 |
| 3 | Average representation over residues with a specific amino acid type | Avg | 20 |
| 4 | Average representation over correlations between two residues given sequence distance | Avg | 3 |
Performance comparison of baseline classifiers based on the fivefold CV and the test set using the entire feature sets.
| Feature set | Classifier | Fivefold cross-validation on PDB1616 | Blind test on PDB186 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ACC (%) | MCC | SP (%) | SN (%) | ACC (%) | MCC | SP (%) | SN (%) | ||
| PSSMR_Avg | GNB | 65.53 | 0.3108 | 67.08 | 63.99 | 61.83 | 0.2406 | 52.69 | 70.97 |
| KNN | 66.09 | 0.3229 | 70.17 | 62.00 | 58.06 | 0.1613 | 56.99 | 59.14 | |
| DT | 60.89 | 0.2178 | 60.52 | 61.26 | 60.75 | 0.2157 | 56.99 | 64.52 | |
| LR | 69.74 | 0.3950 |
| 68.32 | 63.98 | 0.2856 | 53.76 | 74.19 | |
| SVM |
|
| 64.98 |
|
|
| 50.54 |
| |
| RF | 69.37 | 0.3875 | 68.32 | 70.42 | 65.59 | 0.3154 |
| 73.12 | |
| GBDT | 69.25 | 0.3849 | 69.06 | 69.43 | 65.05 | 0.3041 |
| 72.04 | |
| XGB | 68.19 | 0.3640 | 66.71 | 69.68 | 60.22 | 0.2047 | 56.99 | 63.44 | |
|
| |||||||||
| PSSMS_Avg | GNB | 68.32 | 0.3666 | 66.34 | 70.30 | 68.28 | 0.3704 | 60.22 | 76.34 |
| KNN | 68.13 | 0.3629 | 66.34 | 69.93 | 67.74 | 0.3552 | 65.59 | 69.89 | |
| DT | 62.93 | 0.2587 | 63.37 | 62.50 | 63.44 | 0.2689 | 62.37 | 64.52 | |
| LR | 70.92 | 0.4186 | 69.06 | 72.77 | 70.97 | 0.4242 | 63.44 | 78.49 | |
| SVM |
|
| 66.71 |
|
|
| 59.14 |
| |
| RF | 69.80 | 0.3962 | 68.19 | 71.41 |
| 0.4494 | 62.37 | 81.72 | |
| GBDT | 71.66 | 0.4336 |
| 74.01 |
| 0.4418 |
| 75.27 | |
| XGB | 69.06 | 0.3814 | 67.45 | 70.67 | 70.97 | 0.4242 | 63.44 | 78.49 | |
|
| |||||||||
| PSSMR_All | GNB | 64.85 | 0.2975 | 67.70 | 62.00 | 60.75 | 0.2154 | 63.44 | 58.06 |
| KNN | 59.34 | 0.2078 |
| 37.50 | 59.14 | 0.2025 |
| 37.63 | |
| DT | 59.84 | 0.1968 | 59.28 | 60.40 | 57.53 | 0.1506 | 55.91 | 59.14 | |
| LR | 68.44 | 0.3692 | 70.67 | 66.21 |
| 0.3443 | 65.59 | 68.82 | |
| SVM |
|
| 67.82 |
|
|
| 61.29 |
| |
| RF | 70.17 | 0.4044 | 66.83 | 73.51 | 62.37 | 0.2511 | 53.76 | 70.97 | |
| GBDT | 70.73 | 0.4146 | 70.67 | 70.79 | 64.52 | 0.2920 | 59.14 | 69.89 | |
| XGB | 69.18 | 0.3837 | 68.69 | 69.68 | 65.05 | 0.3051 | 56.99 |
| |
|
| |||||||||
| PSSMS_All | GNB | 64.98 | 0.3083 |
| 53.09 | 57.53 | 0.1610 |
| 39.78 |
| KNN | 62.56 | 0.2610 | 76.11 | 49.01 | 64.52 | 0.2920 | 69.89 | 59.14 | |
| DT | 63.06 | 0.2612 | 64.23 | 61.88 | 54.30 | 0.0866 | 48.39 | 60.22 | |
| LR | 69.74 | 0.3949 | 68.81 | 70.67 | 69.35 | 0.3873 | 67.74 | 70.97 | |
| SVM |
|
| 65.97 |
| 73.12 | 0.4734 | 62.37 |
| |
| RF | 71.60 | 0.4337 | 67.08 | 76.11 | 73.66 | 0.4812 | 64.52 | 82.80 | |
| GBDT | 70.30 | 0.4061 | 68.81 | 71.78 |
|
| 70.97 | 79.57 | |
| XGB | 70.24 | 0.4056 | 66.96 | 73.51 | 69.89 | 0.4012 | 63.44 | 76.34 | |
|
| |||||||||
| ESM_Avg | GNB | 71.35 | 0.4275 | 73.89 | 68.81 | 70.97 | 0.4209 | 75.27 | 66.67 |
| KNN | 74.63 | 0.4927 | 73.64 | 75.62 | 72.58 | 0.4548 | 66.67 | 78.49 | |
| DT | 63.00 | 0.2599 | 63.49 | 62.50 | 61.29 | 0.2266 | 56.99 | 65.59 | |
| LR | 78.22 | 0.5646 | 76.86 | 79.58 | 78.49 | 0.5765 | 70.97 | 86.02 | |
| SVM |
|
| 72.52 |
|
|
| 69.89 |
| |
| RF | 74.32 | 0.4864 |
| 74.13 | 75.27 | 0.5055 |
| 74.19 | |
| GBDT | 76.67 | 0.5339 |
| 78.84 | 74.73 | 0.4960 | 70.97 | 78.49 | |
| XGB | 75.43 | 0.5090 | 73.64 | 77.23 | 77.42 | 0.5495 | 74.19 | 80.65 | |
|
| |||||||||
| ESM_All | GNB | 65.78 | 0.3235 | 76.73 | 54.83 | 58.60 | 0.1811 |
| 43.01 |
| KNN | 64.17 | 0.2972 |
| 49.13 | 66.13 | 0.3269 | 32.69 | 58.06 | |
| DT | 60.27 | 0.2056 | 62.38 | 58.17 | 56.45 | 0.1318 | 46.24 | 66.67 | |
| LR | 78.28 | 0.5658 | 76.86 | 79.70 | 77.96 | 0.5666 | 69.89 | 86.02 | |
| SVM |
|
| 71.53 |
|
|
| 67.74 |
| |
| RF | 72.59 | 0.4520 | 70.79 | 74.38 | 73.12 | 0.4641 | 68.82 | 77.42 | |
| GBDT | 77.72 | 0.5550 | 75.50 | 79.95 | 75.27 | 0.5083 | 69.89 | 80.65 | |
| XGB | 77.41 | 0.5487 | 75.37 | 79.46 | 75.27 | 0.5112 | 67.74 | 82.80 | |
Note. The number highlighted in bold is the best result corresponding to one feature set. An underlined number represents the optimal result over all feature sets.
Figure 1Result comparison of importance-based feature selection methods including MIC, Chi2, variance, LR, LinSVM, and RF that are investigated in the context of fivefold cross-validation based on four feature sets: (a) PSSMR_All, (b) PSSMS_All, (c) ESM_Avg, and (d) ESM_All.
Figure 2Plots of ACC scores of six feature ranking methods using less than 10% of features in ESM_All.
Figure 3Performance comparisons using equal numbers of features for two optimal feature selection methods: (a) LR and (b) LinSVM.
Results of embedded FS methods using three regularizers of the linear model in the light of 5CV on PDB1616.
| Feature set | FS method | #Features | Fivefold cross-validation on PDB1616 | |||
|---|---|---|---|---|---|---|
| ACC (%) | MCC | SP (%) | SN (%) | |||
| PSSMR_All | NFS | 580 | 71.47 | 0.4306 | 67.82 | 75.12 |
| ElasticNet | 188 | 72.77 | 0.4563 | 69.68 | 75.87 | |
| Lasso | 61 | 72.83 | 0.4577 | 69.43 | 76.24 | |
| LassoLars | 58 | 73.51 | 0.4707 | 71.53 | 75.50 | |
|
| ||||||
| PSSMS_All | NFS | 580 | 72.77 | 0.4597 | 65.97 | 79.58 |
| ElasticNet | 207 | 74.01 | 0.4847 | 67.20 | 80.82 | |
| Lasso | 54 | 73.82 | 0.4794 | 68.32 | 79.33 | |
| LassoLars | 38 | 72.77 | 0.4586 | 66.96 | 78.59 | |
|
| ||||||
| ESM_Avg | NFS | 1280 | 79.27 | 0.5908 | 72.52 | 86.01 |
| ElasticNet | 430 | 81.93 | 0.6442 | 75.37 | 88.49 | |
| Lasso | 142 | 83.11 | 0.6656 | 77.97 | 88.24 | |
| LassoLars | 151 | 82.43 | 0.6514 | 77.72 | 87.13 | |
|
| ||||||
| ESM_All | NFS | 37120 | 78.90 | 0.5843 | 71.53 | 86.26 |
| ElasticNet | 884 | 86.14 | 0.7267 | 80.94 | 91.34 | |
| Lasso | 367 |
|
|
|
| |
| LassoLars | 250 | 86.70 | 0.7353 | 83.66 | 89.73 | |
Results of RFECV feature selections.
| Feature set | FS method | #Features | 5CV on PDB1616 | Test on PDB186 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| ACC | MCC | SP | SN | ACC | MCC | SP | SN | |||
| PSSMR_All | NFS | 580 | 71.47 | 43.06 | 67.82 | 75.12 | 67.20 | 34.65 | 61.29 | 73.12 |
| LinSVM20 | 116 |
|
| 71.91 |
|
|
| 62.37 |
| |
| LinSVM20_RFE | 110 |
| 48.43 |
| 76.24 | 68.28 | 36.73 |
| 73.12 | |
| LR20 | 116 | 73.64 | 47.35 | 70.92 | 76.36 | 65.59 | 31.45 | 59.14 | 72.04 | |
| LR20_RFE | 114 | 73.58 | 47.22 | 70.92 | 76.24 | 66.13 | 32.58 | 59.14 | 73.12 | |
|
| ||||||||||
| PSSMS_All | NFS | 580 | 72.77 | 45.97 | 65.97 | 79.58 | 73.12 | 47.34 | 62.37 |
|
| LinSVM30 | 174 | 75.56 | 51.36 |
| 80.45 | 73.66 | 48.12 | 64.52 | 82.80 | |
| LinSVM30_RFE | 149 | 75.00 | 50.34 | 69.18 | 80.82 | 71.51 | 43.94 | 61.29 | 81.72 | |
| LR30 | 174 |
| 51.93 | 69.18 | 82.30 |
|
|
|
| |
| LR30_RFE | 157 |
|
| 68.69 |
| 73.66 | 48.33 | 63.44 |
| |
|
| ||||||||||
| ESM_Avg | NFS | 1280 | 79.27 | 59.08 | 72.52 | 86.01 | 79.03 | 59.06 | 69.89 | 88.17 |
| LinSVM20 | 256 | 81.31 | 62.92 | 76.49 | 86.14 | 78.49 | 57.85 | 69.89 | 87.10 | |
| LinSVM20_RFE | 253 | 81.06 | 62.38 | 76.61 | 85.52 | 78.49 | 57.85 | 69.89 | 87.10 | |
| LR20 | 256 |
|
|
|
| 78.49 | 57.65 |
| 86.02 | |
| LR20_RFE | 243 | 82.49 | 65.45 | 76.49 | 88.49 |
|
| 69.89 |
| |
|
| ||||||||||
| ESM_All | NFS | 37120 | 78.90 | 58.43 | 71.53 | 86.26 | 79.57 | 60.87 | 67.74 | 91.40 |
| LinSVM5 | 1856 | 85.64 | 71.86 | 79.33 | 91.96 | 79.03 | 60.68 | 64.52 |
| |
| LinSVM5_RFE | 1581 | 87.07 | 74.61 | 81.44 | 92.70 | 78.49 | 59.36 | 64.52 | 92.47 | |
| LR4 | 1485 | 90.22 | 80.75 | 85.89 | 94.55 | 80.11 | 61.81 |
| 91.40 | |
| LR4_RFE | 1392 |
|
|
|
|
|
|
| 92.47 | |
Note. The number highlighted in bold is the best result corresponding to one feature set. An underlined number represents the optimal result over all feature sets.
Performance comparison of the proposed ensemble FSEiDBP with other predictors validated on the independent dataset PDB186.
| Methods | ACC (%) | MCC | SP (%) | SN (%) |
|---|---|---|---|---|
| DNA-Threader [ | 59.70 | 0.2790 |
| 23.70 |
| DNAbinder [ | 60.80 | 0.2160 | 64.50 | 57.00 |
| DNA-Prot [ | 61.80 | 0.2400 | 53.80 | 69.90 |
| iDNA-Prot [ | 67.20 | 0.3440 | 66.70 | 67.70 |
| DNABIND [ | 67.70 | 0.3550 | 68.80 | 66.70 |
| Kmer1+ACC [ | 71.00 | 0.4310 | 59.10 | 82.80 |
| iDNAPro-PseAAC [ | 71.50 | 0.4420 | 60.2 | 82.8 |
| Wang's method [ | 76.30 | 0.5570 | 60.20 | 92.50 |
| DBPPred [ | 76.90 | 0.5380 | 74.20 | 79.60 |
| DPP-PseAAC [ | 77.40 | 0.5500 | 70.90 | 83.00 |
| Local-DPP [ | 79.00 | 0.6250 | 65.60 | 92.50 |
| iDBP-DEP [ | 80.10 | 0.6250 | 66.70 |
|
| iDNAProt-ES [ | 80.64 | 0.6130 | 80.00 | 81.30 |
|
|
|
| 76.34 | 90.32 |