| Literature DB >> 29903999 |
Jian Zhang1, Haiting Chai2, Song Guo3, Huaping Guo4, Yanling Li5.
Abstract
Secreted proteins are widely spread in living organisms and cells. Since secreted proteins are easy to be detected in body fluids, urine, and saliva in clinical diagnosis, they play important roles in biomarkers for disease diagnosis and vaccine production. In this study, we propose a novel predictor for accurate high-throughput identification of mammalian secreted proteins that is based on sequence-derived features. We combine the features of amino acid composition, sequence motifs, and physicochemical properties to encode collected proteins. Detailed feature analyses prove the effectiveness of the considered features. Based on the differences across various species of secreted proteins, we introduce the species-specific scheme, which is expected to further explore the intrinsic attributes of specific secreted proteins. Experiments on benchmark datasets prove the effectiveness of our proposed method. The test on independent testing dataset also promises a good generalization capability. When compared with the traditional universal model, we experimentally demonstrate that the species-specific scheme is capable of significantly improving the prediction performance. We use our method to make predictions on unreviewed human proteome, and find 272 potential secreted proteins with probabilities that are higher than 99%. A user-friendly web server, named iMSPs (identification of Mammalian Secreted Proteins), which implements our proposed method, is designed and is available for free for academic use at: http://www.inforstation.com/webservers/iMSP/.Entities:
Keywords: high-throughput; human proteome; secreted proteins; species-specific
Mesh:
Substances:
Year: 2018 PMID: 29903999 PMCID: PMC6099666 DOI: 10.3390/molecules23061448
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1The relative amino acid composition of secreted proteins and non-secreted proteins in various datasets. The amino acids are sorted according to their enrichments in secreted proteins.
Figure 2Physicochemical properties of secreted proteins and non-secreted proteins in (A) Secreted proteins (SPs)-all, (B) SPs-H, (C) SPs-M, (D) SPs-B, (E) SPs-C and (F) SPs-O. Physicochemical properties (PCP) represent hydrophobicity (PCP. [1]), polarity (PCP. [2]), solvation free energy (PCP. [3]), graph shape index (PCP. [4]), transfer free energy (PCP. [5]), correlation coefficient in regression analysis (PCP. [6]), residue accessible surface area (PCP. [7]), partition coefficient (PCP. [8]), entropy of formulation (PCP. [9]) and protein kinase A (PCP. [10]), respectively..
The top 20 informative motifs in various datasets. ‘-’ denotes arbitrary 20 amino acids.
| SPs-All | SPs- | SPs- | SPs- | SPs- | SPs- | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| MTF | RDI | MTF | RDI | MTF | RDI | MTF | RDI | MTF | RDI | MTF | RDI |
| LLLL | 0.035 | LLLL | 0.039 | LLLL | 0.037 | LLLL | 0.045 | C-CR | 0.053 | G-CP | 0.064 |
| LL-LLL | 0.034 | LL-LLL | 0.038 | LL-LLL | 0.035 | G-CP | 0.035 | G-CP | 0.047 | C-VP | 0.057 |
| LLL-LL | 0.032 | LLL-LL | 0.036 | C-CP | 0.027 | CP-G | 0.033 | CG-C | 0.047 | GC-P | 0.052 |
| CP-G | 0.022 | LAL-L | 0.027 | C-QG | 0.027 | C-PG | 0.032 | C-AG | 0.047 | CS-C | 0.051 |
| LAL-L | 0.022 | L-LLA | 0.024 | CP-G | 0.026 | C-CL | 0.032 | KGD | 0.046 | SC-C | 0.051 |
| G-TC | 0.021 | LL-LA | 0.024 | C-NG | 0.024 | L-LLA | 0.032 | CP-Q | 0.043 | C-SC | 0.049 |
| C-PG | 0.020 | L-LLG | 0.024 | G-TC | 0.024 | S-SC | 0.032 | CC-P | 0.043 | C-CR | 0.049 |
| LLL-A | 0.020 | LL-LG | 0.024 | CQ-G | 0.024 | CS-S | 0.031 | GR-C | 0.042 | SC-P | 0.047 |
| LL-LA | 0.020 | LLL-A | 0.023 | C-PG | 0.024 | AC-P | 0.031 | CG-R | 0.041 | C-SG | 0.046 |
| L-LLA | 0.020 | LLL-G | 0.023 | GG-C | 0.022 | LL-LA | 0.030 | C-CL | 0.040 | SG-C | 0.046 |
| GT-C | 0.019 | C-PG | 0.022 | CA-G | 0.022 | CA-P | 0.030 | SC-C | 0.040 | CG-C | 0.046 |
| GS-C | 0.018 | LLA-L | 0.022 | C-SC | 0.022 | LLL-A | 0.030 | CC-R | 0.040 | GC-G | 0.045 |
| L-LW | 0.018 | CP-G | 0.022 | C-GG | 0.021 | LCL | 0.029 | CV-P | 0.039 | CC-P | 0.044 |
| ALL-L | 0.017 | ALL-L | 0.021 | GE-C | 0.021 | G-SC | 0.029 | CA-G | 0.039 | LLLL | 0.044 |
| LLA-L | 0.017 | L-LAL | 0.021 | GK-C | 0.020 | SC-S | 0.029 | PQG | 0.037 | KPG | 0.044 |
| LL-AL | 0.017 | LL-AL | 0.020 | GT-C | 0.019 | C-SS | 0.028 | CS-C | 0.037 | C-PT | 0.043 |
| G-RC | 0.017 | LA-LL | 0.020 | G-SC | 0.019 | G-CS | 0.028 | C-SC | 0.037 | GDR | 0.043 |
| P-CP | 0.017 | AL-LL | 0.020 | C-PR | 0.019 | CG-G | 0.027 | RGP | 0.037 | CS-G | 0.042 |
| CP-P | 0.017 | G-TC | 0.020 | WL-L | 0.019 | AC-S | 0.027 | C-PT | 0.036 | C-GC | 0.041 |
| CA-P | 0.016 | L-LW | 0.019 | G-RC | 0.019 | A-CL | 0.027 | PGQ | 0.036 | S-SC | 0.039 |
Figure 3Example of leucine-rich motifs in mammalian secreted proteins. Panels (A–C) are captured from protein 3D structure 4GRW (human IL-23 with 3 Nanobodies), 1T8T (human 3-O-Sulfotransferase-3 with bound PAP), and 5NV6 (human transforming growth factor beta-induced protein), respectively.
The prediction performance of different features on six training datasets over five-fold cross-validation.
| Dataset | Feature | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|---|
| SPs-all | AAC | 0.695 | 0.734 | 0.714 | 0.429 | 0.773 |
| MTF | 0.354 | 0.910 | 0.632 | 0.317 | 0.660 | |
| PCP | 0.707 | 0.702 | 0.705 | 0.410 | 0.754 | |
| SPs- | AAC | 0.697 | 0.719 | 0.708 | 0.416 | 0.736 |
| MTF | 0.469 | 0.846 | 0.657 | 0.340 | 0.677 | |
| PCP | 0.670 | 0.755 | 0.712 | 0.426 | 0.746 | |
| SPs- | AAC | 0.685 | 0.734 | 0.709 | 0.419 | 0.754 |
| MTF | 0.361 | 0.896 | 0.628 | 0.304 | 0.658 | |
| PCP | 0.652 | 0.722 | 0.687 | 0.374 | 0.732 | |
| SPs- | AAC | 0.663 | 0.781 | 0.722 | 0.447 | 0.765 |
| MTF | 0.247 | 0.988 | 0.618 | 0.350 | 0.682 | |
| PCP | 0.401 | 0.953 | 0.676 | 0.424 | 0.731 | |
| SPs- | AAC | 0.612 | 0.791 | 0.701 | 0.410 | 0.762 |
| MTF | 0.418 | 0.925 | 0.672 | 0.398 | 0.667 | |
| PCP | 0.463 | 0.900 | 0.682 | 0.404 | 0.759 | |
| SPs- | AAC | 0.677 | 0.797 | 0.737 | 0.477 | 0.744 |
| MTF | 0.563 | 0.870 | 0.716 | 0.454 | 0.725 | |
| PCP | 0.490 | 0.807 | 0.648 | 0.313 | 0.693 |
The performance of the optimum feature subset general Mammalia secreted proteins and five species-specific secreted proteins over five-fold cross-validation.
| Dataset | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|
| SPs-all | 0.705 | 0.783 | 0.744 | 0.490 | 0.806 |
| SPs- | 0.673 | 0.833 | 0.753 | 0.513 | 0.799 |
| SPs- | 0.634 | 0.847 | 0.740 | 0.492 | 0.783 |
| SPs- | 0.728 | 0.825 | 0.777 | 0.556 | 0.815 |
| SPs- | 0.657 | 0.876 | 0.766 | 0.546 | 0.784 |
| SPs- | 0.771 | 0.870 | 0.820 | 0.644 | 0.835 |
Comparison between species-specific and universal schemes on different species of training datasets over five-fold cross-validation.
| Dataset | Model | Sensitivity | Specificity | Accuracy | MCC |
|---|---|---|---|---|---|
| SPs- | iMSP- | 0.673 | 0.833 | 0.753 | 0.513 |
| iMSP- | 0.647 | 0.820 | 0.733 | 0.474 | |
| SPs- | iMSP- | 0.634 | 0.847 | 0.740 | 0.492 |
| iMSP- | 0.652 | 0.789 | 0.721 | 0.446 | |
| SPs- | iMSP- | 0.728 | 0.825 | 0.777 | 0.556 |
| iMSP- | 0.695 | 0.811 | 0.753 | 0.509 | |
| SPs- | iMSP- | 0.657 | 0.876 | 0.766 | 0.546 |
| iMSP- | 0.473 | 0.841 | 0.657 | 0.337 | |
| SPs- | iMSP- | 0.771 | 0.870 | 0.820 | 0.644 |
| iMSP- | 0.615 | 0.823 | 0.719 | 0.447 |
Figure 4Comparison of predicted AUC values between species-specific and universal models.
The performance of different methods on six testing datasets.
| Dataset | Method | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|---|
| SPs-all | SecretomeP | 0.611 | 0.798 | 0.763 | 0.355 | 0.729 |
| SRTpred | 0.652 | 0.824 | 0.792 | 0.419 | 0.781 | |
| iMSP-U | 0.590 | 0.865 | 0.814 | 0.427 | 0.802 | |
| SPs- | SecretomeP | 0.632 | 0.787 | 0.762 | 0.340 | 0.764 |
| SRTpred | 0.678 | 0.802 | 0.782 | 0.392 | 0.770 | |
| iMSP-H | 0.631 | 0.866 | 0.829 | 0.443 | 0.821 | |
| iMSP-U | 0.538 | 0.908 | 0.850 | 0.441 | 0.817 | |
| SPs- | SecretomeP | 0.629 | 0.832 | 0.731 | 0.471 | 0.776 |
| SRTpred | 0.707 | 0.793 | 0.751 | 0.503 | 0.785 | |
| iMSP-M | 0.742 | 0.776 | 0.759 | 0.519 | 0.809 | |
| iMSP-U | 0.703 | 0.802 | 0.753 | 0.507 | 0.803 | |
| SPs- | SecretomeP | 0.575 | 0.861 | 0.824 | 0.367 | 0.768 |
| SRTpred | 0.670 | 0.857 | 0.833 | 0.431 | 0.787 | |
| iMSP-B | 0.547 | 0.901 | 0.856 | 0.411 | 0.795 | |
| iMSP-U | 0.679 | 0.766 | 0.755 | 0.327 | 0.763 | |
| SPs- | SecretomeP | 0.549 | 0.921 | 0.865 | 0.470 | 0.779 |
| SRTpred | 0.686 | 0.866 | 0.839 | 0.478 | 0.782 | |
| iMSP-C | 0.412 | 0.962 | 0.880 | 0.457 | 0.789 | |
| iMSP-U | 0.667 | 0.670 | 0.670 | 0.247 | 0.718 | |
| SPs- | SecretomeP | 0.729 | 0.782 | 0.775 | 0.390 | 0.747 |
| SRTpred | 0.792 | 0.842 | 0.835 | 0.509 | 0.820 | |
| iMSP-O | 0.646 | 0.913 | 0.876 | 0.521 | 0.841 | |
| iMSP-U | 0.521 | 0.805 | 0.766 | 0.264 | 0.716 |
Predicted probabilities to be potential secreted proteins in human proteome.
|
|
|
|
|
|
|
| iMSP- | 1155 (2.24%) | 5984 (11.63%) | 7803 (15.16%) | 7684 (14.93%) | 7100 (13.79%) |
| iMSP- | 1904 (3.70%) | 6213 (12.07%) | 7333 (14.25%) | 7219 (14.03%) | 6769 (13.15%) |
|
|
|
|
|
|
|
| iMSP- | 5551 (10.79%) | 4745 (9.22%) | 4028 (7.83%) | 3768 (7.32%) | 3651 (7.09%) |
| iMSP- | 5536 (10.76%) | 4848 (9.42%) | 4046 (7.86%) | 3993 (7.76%) | 3608 (7.01%) |
A breakdown of newly-compiled datasets used in this work.
| Dataset | Species | All Dataset | Training Dataset | Testing Daset |
|---|---|---|---|---|
| (numP, numN) * | (numP, numN) * | (numP, numN) * | ||
| SPs-all |
| (2560, 4299) | (2048, 2048) | (512, 2251) |
| SPs- |
| (1986, 3714) | (1588, 1588) | (398, 2126) |
| SPs- |
| (1144, 1147) | (915, 915) | (229, 232) |
| SPs- |
| (529, 1148) | (423, 423) | (106, 725) |
| SPs- |
| (252, 492) | (201, 201) | (51, 291) |
| SPs- |
| (240, 490) | (192, 192) | (48, 298) |
* numP and numN represent the numbers of secreted proteins and non-secreted proteins respectively.