| Literature DB >> 29182548 |
Cong Shen1,2, Yijie Ding3,4, Jijun Tang5,6,7, Jian Song8, Fei Guo9,10.
Abstract
DNA-protein interactions appear as pivotal roles in diverse biological procedures and are paramount for cell metabolism, while identifying them with computational means is a kind of prudent scenario in depleting in vitro and in vivo experimental charging. A variety of state-of-the-art investigations have been elucidated to improve the accuracy of the DNA-protein binding sites prediction. Nevertheless, structure-based approaches are limited under the condition without 3D information, and the predictive validity is still refinable. In this essay, we address a kind of competitive method called Multi-scale Local Average Blocks (MLAB) algorithm to solve this issue. Different from structure-based routes, MLAB exploits a strategy that not only extracts local evolutionary information from primary sequences, but also using predicts solvent accessibility. Moreover, the construction about predictors of DNA-protein binding sites wields an ensemble weighted sparse representation model with random under-sampling. To evaluate the performance of MLAB, we conduct comprehensive experiments of DNA-protein binding sites prediction. MLAB gives M C C of 0.392 , 0.315 , 0.439 and 0.245 on PDNA-543, PDNA-41, PDNA-316 and PDNA-52 datasets, respectively. It shows that MLAB gains advantages by comparing with other outstanding methods. M C C for our method is increased by at least 0.053 , 0.015 and 0.064 on PDNA-543, PDNA-41 and PDNA-316 datasets, respectively.Entities:
Keywords: DNA–protein binding sites; ensemble classifier; feature extraction; random sub-sampling; sparse representation model
Mesh:
Substances:
Year: 2017 PMID: 29182548 PMCID: PMC6149935 DOI: 10.3390/molecules22122079
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1Schematic diagram of PSSM (Position Specific Scoring Matrix)-MLAB (Multi-scale Local Average Blocks) feature extraction.
Figure 2Overview of the ensemble classifier.
Four different datasets of DNA–protein binding sites.
| Dataset | No. of Sequences | No. of Binding | No. of Non-Binding | Ratio |
|---|---|---|---|---|
| PDNA(Protein and DNA)-543 | 543 | 9549 |
|
|
| PDNA-41 | 41 | 734 |
|
|
| PDNA-335 | 335 | 6461 |
|
|
| PDNA-52 | 52 | 973 |
|
|
| PDNA-316 | 316 | 5609 |
|
|
: No. of Binding represents the number of positive samples. : No. of Non-Binding represents the number of negative samples. : Ratio = No. of Non-Binding / No. of Binding.
Figure 3The (Matthew Correlation Coefficient) of PSSM (Position Specific Scoring Matrix)- MLAB (Multi-scale Local Average Blocks) with different sizes of sliding window and numbers of base classifiers (WSRC (Weighted Sparse Representation based Classifier), Equation(16)).
The performance comparison of different features through ten-fold cross-validation by EC-RUS (Ensemble Classifier with Random Under-Sampling) (WSRC (Weighted Sparse Representation based Classifier), Equation (16)) on PDNA-543 dataset.
| Feature |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| PSSM ( | 0.7738 | 0.7570 | 0.7581 | 0.1844 | 0.294 | 0.843 |
| PSSM ( | 0.4377 | 0.9500 | 0.9160 | 0.3832 | 0.364 | 0.843 |
| PSSM + PSA ( | 0.7850 | 0.7590 | 0.7607 | 0.1874 | 0.302 | 0.851 |
| PSSM + PSA ( | 0.4541 | 0.9494 | 0.9166 | 0.3886 | 0.375 | 0.851 |
| PSSM-MLAB ( | 0.7744 | 0.7599 | 0.7609 | 0.1864 | 0.297 | 0.848 |
| PSSM-MLAB ( | 0.4516 | 0.9178 | 0.3955 | 0.378 | 0.848 | |
| PSSM-MLAB + PSA ( | 0.7629 | 0.7646 | 0.1907 | 0.307 | ||
| PSSM-MLAB + PSA ( | 0.4762 | 0.9492 |
Figure 4The (Area Under the Receiver Operating Characteristic) and (Area Under the Precision-Recall curve) of PSSM (Position Specific Scoring Matrix), PSSM + PSA (Predicted Solvent Accessibility), PSSM-MLAB (Multi-scale Local Average Blocks) and PSSM-MLAB + PSA obtained with EC-RUS (Ensemble Classifier with Random Under-Sampling) (WSRC (Weighted Sparse Representation based Classifier), Equation (16)) on PDNA (Protein and DNA)-543 dataset over a ten-fold cross-validation test. (a) receiver operating characteristic curves; (b) precision–recall curves.
Figure 5Results for different score functions on PDNA-543. Type 1, 2 and 3 represent Equations (16)–(18), respectively.
Comparison with the TargetDNA on PDNA-543 dataset by ten-fold cross-validation.
| Methods |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| TargetDNA ( | 0.7698 | 0.7705 | 0.7704 | 0.1918 | 0.304 | 0.845 |
| TargetDNA ( | 0.4060 | 0.9140 | 0.3647 | 0.339 | 0.845 | |
| Our method ( | 0.7629 | 0.7646 | 0.1907 | 0.307 | ||
| Our method ( | 0.4762 | 0.9492 |
Results excerpted from [13].
Comparison with some state-of-the-art works on the Independent PDNA-41 dataset.
| Methods |
|
|
|
|
|
|---|---|---|---|---|---|
| BindN | 0.143 | 0.4564 | 0.8090 | 0.7915 | 0.1112 |
| ProteDNA | 0.160 | 0.0477 | |||
| BindN+ ( | 0.178 | 0.2411 | 0.9511 | 0.9158 | 0.2051 |
| BindN+ ( | 0.213 | 0.5081 | 0.8541 | 0.8369 | 0.1542 |
| MetaDBSite | 0.221 | 0.3420 | 0.9335 | 0.9041 | 0.2122 |
| DP-Bind | 0.241 | 0.6172 | 0.8243 | 0.8140 | 0.1553 |
| DNABind | 0.264 | 0.7016 | 0.8028 | 0.7978 | 0.1570 |
| TargetDNA ( | 0.269 | 0.6022 | 0.8579 | 0.8452 | 0.1816 |
| TargetDNA ( | 0.300 | 0.4550 | 0.9327 | 0.9089 | 0.2613 |
| EC-RUS (WSRC) ( | 0.193 | 0.6104 | 0.7725 | 0.7644 | 0.1231 |
| EC-RUS (WSRC) ( | 0.2725 | 0.9731 | 0.9458 | 0.4292 | |
| EC-RUS (SVM) ( | 0.261 | 0.6975 | 0.8032 | 0.7972 | 0.1567 |
| EC-RUS (SVM) ( | 0.302 | 0.3787 | 0.9577 | 0.9281 | 0.3092 |
| EC-RUS (RF) ( | 0.234 | 0.6785 | 0.7818 | 0.7767 | 0.1401 |
| EC-RUS (RF) ( | 0.261 | 0.3351 | 0.9524 | 0.9217 | 0.2691 |
| EC-RUS (L1-LR) ( | 0.228 | 0.6199 | 0.8084 | 0.7991 | 0.1449 |
| EC-RUS (L1-LR) ( | 0.246 | 0.3120 | 0.9541 | 0.9221 | 0.2623 |
| EC-RUS (SBL) ( | 0.219 | 0.7434 | 0.7416 | 0.1263 | |
| EC-RUS (SBL) ( | 0.247 | 0.3202 | 0.9521 | 0.9206 | 0.2591 |
*: Results excerpted from [13]. : The feature is PSSM-MLAB + PSA. In addition, the EC-RUS model is built with different base classifiers.
Figure 6Results for different thresholds of probability on independent test set of PDNA-41. Rate p/n means the ratio between the predictive number of binding sites and the predictive number of non-binding sites.
Comparison of the prediction performance between the proposed method and some state-of-the-art works on PDNA-316 dataset.
| Methods |
|
|
|
|
|---|---|---|---|---|
| DBS-PRED | 0.5300 | 0.7600 | 0.7500 | 0.170 |
| BindN | 0.5400 | 0.8000 | 0.7800 | 0.210 |
| DNABindR | 0.6600 | 0.7400 | 0.7300 | 0.230 |
| DISIS | 0.1900 | 0.250 | ||
| DP-Bind | 0.6900 | 0.7900 | 0.7800 | 0.290 |
| BindN-RF | 0.6700 | 0.8300 | 0.8200 | 0.320 |
| MetaDBSite [ | 0.7700 | 0.7700 | 0.7700 | 0.320 |
| TargetDNA ( | 0.7796 | 0.7803 | 0.7802 | 0.339 |
| TargetDNA ( | 0.4302 | 0.9500 | 0.9099 | 0.375 |
| EC-RUS (WSRC) ( | 0.7818 | 0.7837 | 0.356 | |
| EC-RUS (WSRC) ( | 0.5108 | 0.9499 | 0.9161 | |
| EC-RUS (SVM) ( | 0.8011 | 0.7969 | 0.7973 | 0.369 |
| EC-RUS (SVM) ( | 0.4935 | 0.9500 | 0.9150 | 0.426 |
| EC-RUS (RF) ( | 0.7989 | 0.7542 | 0.7576 | 0.326 |
| EC-RUS (RF) ( | 0.4521 | 0.9502 | 0.9118 | 0.394 |
| EC-RUS (L1-LR) ( | 0.7347 | 0.7659 | 0.7635 | 0.300 |
| EC-RUS (L1-LR) ( | 0.3523 | 0.9498 | 0.9037 | 0.319 |
| EC-RUS (SBL) ( | 0.7453 | 0.7540 | 0.7533 | 0.295 |
| EC-RUS (SBL) ( | 0.3562 | 0.9480 | 0.9023 | 0.317 |
*: Results excerpted from [12,13]. : The feature is PSSM-MLAB + PSA. In addition, EC-RUS model is built with different base classifiers.
Comparison with some state-of-the-art works on PDNA-52 dataset under maximizing the value of .
| Methods |
|
|
|
|
|
|---|---|---|---|---|---|
| TargetS [ | 0.413 | ||||
| MetaDBSite [ | 0.580 | 0.764 | 0.752 | 0.192 | - |
| DNABR [ | 0.407 | 0.873 | 0.846 | 0.185 | - |
| alignment-based | 0.266 | 0.943 | 0.905 | 0.190 | - |
| EC-RUS (WSRC) | 0.467 | 0.913 | 0.896 | 0.245 | 0.808 |
| EC-RUS (SVM) | 0.528 | 0.835 | 0.823 | 0.185 | 0.756 |
| EC-RUS (RF) | 0.561 | 0.773 | 0.764 | 0.152 | 0.741 |
| EC-RUS (L1-LR) | 0.594 | 0.811 | 0.803 | 0.201 | 0.787 |
| EC-RUS (SBL) | 0.782 | 0.776 | 0.192 | 0.786 |
*: Results excerpted from [57]. : The feature is PSSM-MLAB + PSA. In addition, EC-RUS model is built with different base classifiers.
Figure 7Results for different thresholds of probability on Independent test set of PDNA-52. Rate p/n means the ratio between the predictive number of binding sites and the predictive number of non-binding sites.
The statistical significance of between other methods (including MetaDBSite and TargetDNA) and our method.
| Methods | |
|---|---|
| Our method-MetaDBSite | 0.2667 |
| Our method-TargetDNA | 0.4610 |
Figure 8Representative protein-DNA complex: Upper is 4X0P-D (PDB ID: 4X0P, Chain: D), lower is 5BMZ-CD (PDB ID: 5BMZ, Chain: C and D).
Comparison with DP (DNA Protein)-Bind on 4X0P-D and 5BMZ-D.
| PDB (Protein Data Bank) ID | Method |
|
|
|
|
|---|---|---|---|---|---|
| 4X0P-D | our method | 24 | 559 | 34 | 8 |
| DP-Bind | 29 | 439 | 154 | 3 | |
| 5BMZ-D | our method | 14 | 110 | 9 | 3 |
| DP-Bind | 10 | 103 | 16 | 7 |
The running time (seconds) of EC-RUS on PDNA-41 and PDNA-52 independent testing sets.
| Classifier | PDNA-41 | PDNA-52 |
|---|---|---|
| EC-RUS (WSRC) | 9227 | 14,407 |
| EC-RUS (L1-LR) | 705 | 232 |
| EC-RUS (RF) | 3778 | 1632 |
| EC-RUS (SBL) | 136,241 | 40,121 |
| EC-RUS (SVM) | 27,210 | 2043 |