| Literature DB >> 28583090 |
Jian Zhang1,2, Haiting Chai1, Guifu Yang1, Zhiqiang Ma3.
Abstract
BACKGROUND: Bioluminescent proteins (BLPs) widely exist in many living organisms. As BLPs are featured by the capability of emitting lights, they can be served as biomarkers and easily detected in biomedical research, such as gene expression analysis and signal transduction pathways. Therefore, accurate identification of BLPs is important for disease diagnosis and biomedical engineering. In this paper, we propose a novel accurate sequence-based method named PredBLP (Prediction of BioLuminescent Proteins) to predict BLPs.Entities:
Keywords: Bioluminescent proteins; Feature analysis; Lineage-specific; Sequence-derived
Mesh:
Substances:
Year: 2017 PMID: 28583090 PMCID: PMC5460367 DOI: 10.1186/s12859-017-1709-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The pseudo-code of the calculation of motifs
Fig. 2The flowchart of the proposed method
Fig. 3The relative amino acid composition of BLPs against non-BLPs on four datasets
Fig. 4The relative dipeptide composition of BLPs against that of non-BLPs in four datasets. The x-axis indicates the amino acids which are cleaved on the C-terminal side; while y-axis stands for the N-terminal side. Detailed data of their values is provided in Additional file 1: Tables A3-A6
Selected top 10 motifs according to the descending order of DIG values
| Lineage: general | Lineage: bacteria | Lineage: eukaryota | Lineage: archaea | ||||
|---|---|---|---|---|---|---|---|
| Motif | DIG | Motif | DIG | Motif | DIG | Motif | DIG |
| EHH | 0.669 | EHH | 0.693 | G-T-G-P | 0.617 | DGW | 0.809 |
| EH-H | 0.627 | LS-GR | 0.675 | SG-T-G | 0.574 | G-GW | 0.800 |
| L-S-GR | 0.608 | EH-H | 0.644 | GM-E | 0.563 | A-TLD | 0.779 |
| E-HH | 0.607 | L-S-GR | 0.629 | G-M-E | 0.518 | A-A-T-D | 0.745 |
| LS-G-R | 0.588 | E-HH | 0.624 | FVE | 0.505 | D-W-P | 0.741 |
| L-G-GR | 0.579 | L-G-GR | 0.621 | TGD | 0.494 | A-T-LD | 0.733 |
| E-H-H | 0.561 | LS-G-R | 0.608 | FD-I | 0.489 | A-TL-D | 0.726 |
| S-G-G-R | 0.536 | S-G-G-R | 0.587 | D-GY | 0.479 | D-GW | 0.726 |
| A-A-T-R | 0.519 | E-H-H | 0.565 | F-YG | 0.461 | GFD | 0.704 |
| L-S-G-R | 0.515 | A-A-T-R | 0.542 | F-M-G | 0.460 | DG-W | 0.704 |
Fig. 5Basal levels of selected physicochemical and biological properties in four datasets. Midline, box boundaries, and whiskers indicate median, quartiles, and 10th and 90th percentiles. The x-axis indicates the normalized values; and y-axis stands for twelve properties. In this work, a physicochemical property is empirically regarded to be discriminatory provided that the overlap of two boxes is less than 80% of either box
The experimental results of various individual and combinative features on the training set for general BLPs
| Type | Feature | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|---|
| Individual | AACa | 0.729 ± 0.029 | 0.806 ± 0.023 | 0.767 ± 0.020 | 0.537 ± 0.039 | 0.802 ± 0.012 |
| DCb | 0.791 ± 0.017 | 0.857 ± 0.028 | 0.824 ± 0.014 | 0.650 ± 0.029 | 0.830 ± 0.016 | |
| MTFc | 0.313 ± 0.017 | 0.942 ± 0.012 | 0.628 ± 0.008 | 0.328 ± 0.018 | 0.653 ± 0.010 | |
| PCPd | 0.452 ± 0.010 | 0.910 ± 0.026 | 0.681 ± 0.010 | 0.408 ± 0.029 | 0.763 ± 0.014 | |
| Combinative | AAC + DC | 0.799 ± 0.015 | 0.862 ± 0.026 | 0.830 ± 0.008 | 0.663 ± 0.018 | 0.841 ± 0.012 |
| AAC + MTF | 0.764 ± 0.016 | 0.801 ± 0.021 | 0.783 ± 0.005 | 0.566 ± 0.011 | 0.810 ± 0.007 | |
| AAC + PCP | 0.728 ± 0.013 | 0.809 ± 0.013 | 0.768 ± 0.007 | 0.538 ± 0.015 | 0.813 ± 0.008 | |
| DC + MTF | 0.799 ± 0.014 | 0.854 ± 0.008 | 0.826 ± 0.008 | 0.653 ± 0.015 | 0.836 ± 0.005 | |
| DC + PCP | 0.775 ± 0.014 | 0.878 ± 0.020 | 0.827 ± 0.004 | 0.658 ± 0.009 | 0.841 ± 0.006 | |
| MTF + PCP | 0.477 ± 0.010 | 0.917 ± 0.016 | 0.697 ± 0.004 | 0.440 ± 0.014 | 0.764 ± 0.020 | |
| AAC + DC + MTF | 0.772 ± 0.007 | 0.888 ± 0.011 | 0.830 ± 0.008 | 0.665 ± 0.016 | 0.842 ± 0.006 | |
| AAC + DC + PCP | 0.780 ± 0.007 | 0.880 ± 0.016 | 0.830 ± 0.009 | 0.663 ± 0.019 | 0.845 ± 0.009 | |
| AAC + MTF + PCP | 0.742 ± 0.011 | 0.793 ± 0.004 | 0.767 ± 0.005 | 0.536 ± 0.010 | 0.816 ± 0.004 | |
| DC + MTF + PCP | 0.775 ± 0.014 | 0.886 ± 0.025 | 0.830 ± 0.011 | 0.665 ± 0.023 | 0.845 ± 0.014 | |
| AAC + DC + MTF + PCP | 0.770 ± 0.010 | 0.894 ± 0.014 | 0.836 ± 0.004 | 0.676 ± 0.010 | 0.850 ± 0.006 |
The results are reported by maximizing the MCC value of prediction on the corresponding dataset over five-fold cross-validation. a indicates the features of amino acid composition; b stands for the features of dipeptide composition; c is the features of motifs; d represents the features of physicochemical properties
The performance of optimum feature subsets on four training sets using five-fold cross-validation
| Lineage | Number | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|---|
| General | 199 | 0.732 ± 0.010 | 0.949 ± 0.022 | 0.841 ± 0.006 | 0.698 ± 0.018 | 0.883 ± 0.007 |
| Bacteria | 174 | 0.832 ± 0.012 | 0.943 ± 0.016 | 0.888 ± 0.006 | 0.780 ± 0.013 | 0.920 ± 0.010 |
| Eukaryota | 204 | 0.667 ± 0.053 | 0.833 ± 0.053 | 0.750 ± 0.026 | 0.510 ± 0.054 | 0.806 ± 0.015 |
| Archaea | 129 | 0.825 ± 0.061 | 0.900 ± 0.094 | 0.863 ± 0.047 | 0.733 ± 0.095 | 0.917 ± 0.019 |
The results are reported by maximizing the MCC value of prediction on the corresponding dataset
Fig. 6Venn diagrams of the overlap (green zone) between the discriminatory (orange pie) and selected useful features (blue pie) in the optimal subset for each type of features. D indicates the discriminatory features and s stands for selected useful features
Comparison of lineage-specific models with traditional universal models on three training sets using five-fold cross-validation
| Lineage | Model | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|---|
| Bacteria | PredBLP-U | 0.790 ± 0.010 | 0.918 ± 0.014 | 0.854 ± 0.003 | 0.714 ± 0.007 | 0.872 ± 0.006 |
| PredBLP-B | 0.832 ± 0.012 | 0.943 ± 0.016 | 0.888 ± 0.006 | 0.780 ± 0.013 | 0.920 ± 0.010 | |
| Eukaryota | PredBLP-U | 0.417 ± 0.053 | 0.883 ± 0.041 | 0.650 ± 0.033 | 0.340 ± 0.075 | 0.670 ± 0.017 |
| PredBLP-E | 0.667 ± 0.053 | 0.833 ± 0.053 | 0.750 ± 0.026 | 0.510 ± 0.054 | 0.806 ± 0.015 | |
| Archaea | PredBLP-U | 0.750 ± 0.079 | 0.875 ± 0.079 | 0.813 ± 0.040 | 0.637 ± 0.081 | 0.868 ± 0.016 |
| PredBLP-A | 0.825 ± 0.061 | 0.900 ± 0.094 | 0.863 ± 0.047 | 0.733 ± 0.095 | 0.917 ± 0.019 |
The results are reported by maximizing the MCC values of prediction on the corresponding dataset over five-fold cross-validation. PredBLP-U stands for the universal model of the proposed PredBLP predictor. PredBLP-B, PredBLP-E and PredBLP-A indicate three lineage-specific models (i.e. bacteria-, eukaryota- and archaea- specific model) respectively
Comparison of the proposed PredBLP-U with previous methods on Kandaswamy’s training dataset
| Method | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|
| BLProt [ | 0.745 | 0.842 | 0.801 | 0.590 | 0.870 |
| BLPre [ | 0.793 | 0.910 | 0.852 | N/A | 0.920 |
| Fan’s method [ | 0.883 | 0.927 | 0.905 | 0.810 | 0.950 |
| SCMBLP [ | 0.897 | 0.920 | 0.908 | N/A | N/A |
| BLKnn [ | 0.749 | 0.955 | 0.852 | 0.719 | N/A |
| Nath’s method [ | 0.964 | 0.942 | 0.954 | N/A | 0.991 |
| PredBLP-U | 0.912 ± 0.014 | 0.962 ± 0.017 | 0.937 ± 0.009 | 0.875 ± 0.018 | 0.968 ± 0.009 |
Comparison of PredBLP with other methods on the independent testing dataset
| Lineage | Predictor | Sensitivity | Specificity | Accuracy | MCC | AUC |
|
|---|---|---|---|---|---|---|---|
| General | BLProt | 0.348 ± 0.022 | 0.903 ± 0.007 | 0.888 ± 0.007 | 0.132 ± 0.006 | 0.672 ± 0.010 | 0.002 |
| SCMBLP | 0.471 ± 0.019 | 0.868 ± 0.008 | 0.858 ± 0.007 | 0.157 ± 0.004 | N/A | 0.002 | |
| PredBLP-U | 0.611 ± 0.013 | 0.921 ± 0.005 | 0.913 ± 0.004 | 0.294 ± 0.007 | 0.784 ± 0.007 | N/A | |
| Bacteria | BLProt | 0.584 ± 0.020 | 0.769 ± 0.011 | 0.788 ± 0.010 | 0.166 ± 0.008 | 0.674 ± 0.008 | 0.002 |
| SCMBLP | 0.569 ± 0.021 | 0.840 ± 0.013 | 0.831 ± 0.012 | 0.194 ± 0.005 | N/A | 0.002 | |
| PredBLP-U | 0.606 ± 0.015 | 0.909 ± 0.010 | 0.899 ± 0.009 | 0.299 ± 0.013 | 0.773 ± 0.009 | 0.002 | |
| PredBLP-B | 0.638 ± 0.017 | 0.927 ± 0.008 | 0.917 ± 0.007 | 0.352 ± 0.012 | 0.817 | N/A | |
| Eukaryota | BLProt | 0.417 ± 0.037 | 0.966 ± 0.010 | 0.960 ± 0.010 | 0.212 ± 0.018 | 0.719 ± 0.016 | 0.002 |
| SCMBLP | 0.667 ± 0.053 | 0.914 ± 0.014 | 0.912 ± 0.013 | 0.209 ± 0.009 | N/A | 0.002 | |
| PredBLP-U | 0.642 ± 0.038 | 0.954 ± 0.007 | 0.951 ± 0.006 | 0.279 ± 0.011 | 0.765 ± 0.007 | 0.004 | |
| PredBLP-E | 0.750 ± 0.037 | 0.946 ± 0.006 | 0.944 ± 0.005 | 0.301 ± 0.010 | 0.836 ± 0.006 | N/A | |
| Archaea | BLProt | 0.583 ± 0.057 | 0.842 ± 0.016 | 0.838 ± 0.015 | 0.120 ± 0.010 | 0.666 ± 0.007 | 0.002 |
| SCMBLP | 0.550 ± 0.061 | 0.883 ± 0.013 | 0.878 ± 0.012 | 0.154 ± 0.019 | N/A | 0.002 | |
| PredBLP-U | 0.775 ± 0.050 | 0.893 ± 0.014 | 0.891 ± 0.013 | 0.244 ± 0.012 | 0.751 ± 0.010 | 0.002 | |
| PredBLP-A | 0.750 ± 0.056 | 0.922 ± 0.012 | 0.920 ± 0.011 | 0.279 ± 0.012 | 0.789 ± 0.010 | N/A |
Comparison of PredBLP with other methods on newly deposited BLPs
| Lineage | Number of newly deposited BLPs | Predictor | Fraction of correctly identified BLPs |
|
|---|---|---|---|---|
| General | 3741 | BLProt | 0.621 ± 0.013 | 0.002 |
| SCMBLP | 0.792 ± 0.012 | 0.002 | ||
| PredBLP-U | 0.889 ± 0.016 | N/A | ||
| Bacteria | 3614 | BLProt | 0.625 ± 0.022 | 0.002 |
| SCMBLP | 0.795 ± 0.016 | 0.002 | ||
| PredBLP-U | 0.887 ± 0.016 | 0.037 | ||
| PredBLP-B | 0.912 ± 0.015 | N/A | ||
| Eukaryota | 106 | BLProt | 0.841 ± 0.041 | 0.002 |
| SCMBLP | 0.908 ± 0.032 | 0.002 | ||
| PredBLP-U | 0.651 ± 0.031 | 0.002 | ||
| PredBLP-E | 0.983 ± 0.013 | N/A | ||
| Archaea | 21 | BLProt | 0.497 ± 0.046 | 0.002 |
| SCMBLP | 0.954 ± 0.024 | 0.031 | ||
| PredBLP-U | 0.980 ± 0.029 | 0.625 | ||
| PredBLP-A | 0.993 ± 0.024 | N/A |