| Literature DB >> 31244885 |
Jian Zhang1,2, Yu Zhang3,4, Zhiqiang Ma5.
Abstract
The early control and prevention of cancer contributes effectively interventions and cancer therapies. Secretory protein, one of the richest biomarkers, is proved important as molecular signposts of the physiological state of a cell. In this work, we aim to propose a proteomic high-throughput technology platform to facilitate detection of early cancer by means of biomarkers that secreted into the bloodstream. We compile a new benchmark dataset of human secretory proteins in plasma. A series of sequence-derived features, which have been proved involved in the structure and function of the secretory proteins, are collected to mathematically encode these proteins. Considering the influence of potential irrelevant or redundant features, we introduce discrete firefly optimization algorithm to perform feature selection. We evaluate and compare the proposed method SCRIP (Secretory proteins in plasma) with state-of-the-art approaches on benchmark datasets and independent testing datasets. SCRIP achieves the average AUC values of 0.876 and 0.844 in five-fold the cross-validation and independent test, respectively. Besides that, we also test SCRIP on proteins in four types of cancer tissues and successfully detect 66∼77% potential cancer biomarkers.Entities:
Keywords: cancer biomarker; discrete firefly algorithm; human plasma; human proteome; secretory proteins
Year: 2019 PMID: 31244885 PMCID: PMC6563772 DOI: 10.3389/fgene.2019.00542
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Overall framework of the proposed SCRIP method.
The RAAC values for secretory proteins against Swiss-Prot database and secretory proteins against non-secretory proteins.
| AA type | Secretory proteins vs. Swiss-Prot | Secretory proteins vs. non-secretory proteins | AA type | Secretory proteins vs. Swiss-Prot | Secretory proteins vs. non-secretory proteins |
|---|---|---|---|---|---|
| A | M | 0.003 (0.766) | |||
| C | 0.098 (0.000) | N | 0.067 (0.000) | ||
| D | 0.020 (0.051) | P | |||
| E | Q | 0.088 (0.000) | |||
| F | 0.056 (0.000) | R | |||
| G | 0.002 (0.883) | S | |||
| H | 0.051 (0.000) | T | |||
| I | 0.096 (0.000) | V | 0.090 (0.000) | ||
| K | W | 0.068 (0.001) | |||
| L | Y | 0.042 (0.003) |
FIGURE 2Statistics of super-secondary structure motifs in secretory proteins. (A) Fraction of residues that locate on corresponding super-secondary structure motif. (B) Fraction of corresponding super-secondary structure motif.
FIGURE 3Relative difference of evolutionary conservation matrix between secretory and non-secretory proteins. The 20 amino acids residues are shown at the top and right. Values higher than 5% indicate the corresponding substitution is favored by secretory proteins compared with that of non-secretory proteins, and are colored blue. The red grids are the values lower than –5% and stand for the opposite. The amino acids are grouped using agglomerative clustering with complete linkage.
The predictive performance of different types of features on training dataset using 5-fold cross-validation.
| Type of features | Sensitivity | Specificity | Precision | Accuracy | MCC | F1 | AUC | |
|---|---|---|---|---|---|---|---|---|
| RAAC | Average Stdev | 0.623 ± 0.019 | 0.573 ± 0.015 | 0.593 ± 0.009 | 0.598 ± 0.010 | 0.196 ± 0.020 | 0.608 ± 0.012 | 0.645 ± 0.010 |
| SS | Average Stdev | 0.614 ± 0.013 | 0.573 ± 0.015 | 0.590 ± 0.011 | 0.593 ± 0.012 | 0.187 ± 0.023 | 0.601 ± 0.012 | 0.647 ± 0.019 |
| EC | Average Stdev | 0.684 ± 0.021 | 0.691 ± 0.018 | 0.689 ± 0.016 | 0.687 ± 0.016 | 0.375 ± 0.031 | 0.686 ± 0.017 | 0.749 ± 0.011 |
| PP | Average Stdev | 0.562 ± 0.018 | 0.660 ± 0.027 | 0.624 ± 0.023 | 0.611 ± 0.019 | 0.224 ± 0.039 | 0.591 ± 0.019 | 0.652 ± 0.011 |
The predictive performance the combination of different types of features on training dataset using 5-fold cross-validation.
| Type of features | Sensitivity | Specificity | Precision | Accuracy | MCC | Fl | AUC | |
|---|---|---|---|---|---|---|---|---|
| RAAC+SS | Average stdev | 0.630 ± 0.016 | 0.581 ± 0.015 | 0.601 ± 0.009 | 0.606 ± 0.009 | 0.212 ± 0.018 | 0.615 ± 0.010 | 0.647 ± 0.011 |
| RAAC+EC | Average stdev | 0.690 ± 0.020 | 0.698 ± 0.016 | 0.695 ± 0.014 | 0.694 ± 0.014 | 0.387 ± 0.027 | 0.692 ± 0.015 | 0.752 ± 0.009 |
| RAAC+PP | Average stdev | 0.625 ± 0.010 | 0.663 ± 0.019 | 0.650 ± 0.013 | 0.644 ± 0.010 | 0.289 ± 0.021 | 0.637 ± 0.009 | 0.656 ± 0.008 |
| RAAC+SS+EC | Average stdev | 0.692 ± 0.019 | 0.702 ± 0.018 | 0.699 ± 0.017 | 0.697 ± 0.017 | 0.394 ± 0.034 | 0.696 ± 0.017 | 0.652 ± 0.013 |
| RAAC+SS+PP | Average stdev | 0.645 ± 0.018 | 0.653 ± 0.009 | 0.650 ± 0.007 | 0.649 ± 0.009 | 0.298 ± 0.017 | 0.647 ± 0.012 | 0.651 ± 0.013 |
| RAAC+EC+PP | Average stdev | 0.698 ± 0.014 | 0.706 ± 0.012 | 0.703 ± 0.010 | 0.702 ± 0.010 | 0.404 ± 0.020 | 0.701 ± 0.011 | 0.757 ± 0.006 |
| RAAC+SS+EC+PP | Average stdev | 0.707 ± 0.017 | 0.716 ± 0.010 | 0.713 ± 0.008 | 0.711 ± 0.009 | 0.423 ± 0.018 | 0.710 ± 0.011 | 0.765 ± 0.010 |
FIGURE 4Comparison of predictive accuracy and fitness of three swarm optimization algorithms. (A) Discrete firefly algorithm, (B) discrete particle swam optimization, and (C) genetic algorithm.
Comparison of different strategies of feature selection on benchmark training datasets.
| Strategy | Sensitivity | Specificity | Precision | Accuracy | MCC | F1 | AUC | Number of features | |
|---|---|---|---|---|---|---|---|---|---|
| Combination of all features | Average stdev | 0.707 ± 0.017 6.9e-10 | 0.716 ± 0.010 5.2e-10 | 0.713 ± 0.008 5.6e-11 | 0.711 ± 0.009 1.4e-11 | 0.423 ± 0.018 1.4e-11 | 0.710 ± 0.011 2.4e-11 | 0.765 ± 0.010 4.7e-19 | 470 N/A N/A |
| LASSO | Average stdev | 0.734 ± 0.016 5.6e-05 | 0.750 ± 0.004 1.6e-05 | 0.746 ± 0.006 2.2e-06 | 0.742 ± 0.009 1.2e-06 | 0.484 ± 0.017 1.1e-06 | 0.740 ± 0.011 2.2e-06 | 0.784 ± 0.006 2.3e-18 | 74 ± 11 1.8e-20 |
| GA | Average stdev | 0.752 ± 0.024 0.16 | 0.755 ± 0.009 5.2e-04 | 0.754 ± 0.009 4.8e-04 | 0.753 ± 0.013 3.3e-03 | 0.507 ± 0.025 3.2e-03 | 0.753 ±0.016 0.01 | 0.813 ± 0.006 9.4e-16 | 233 ± 11 3.2e-05 |
| DPSO | Average stdev | 0.752 ± 0.013 0.0248 | 0.752 ± 0.005 4.7e-05 | 0.753 ± 0.006 5.7e-05 | 0.751 ± 0.008 1.7e-04 | 0.504 ± 0.016 1.7e-04 | 0.752 ±0.009 4.7e-04 | 0.819 ± 0.004 1.4e-16 | 280 ± 6 5.8e-10 |
| DFA | Average stdev | 0.763 ± 0.007 N/A | 0.777 ± 0.014 N/A | 0.774 ± 0.012 N/A | 0.770 ± 0.009 N/A | 0.540 ± 0.018 N/A | 0.768 ±0.008 N/A | 0.876 ± 0.005 N/A | 254 ± 4 N/A |
Comparison of SCRIP with other state-of-the-art predictors on benchmark testing datasets.
| Predictor | Sensitivity | Specificity | Precision | Accuracy | MCC | F1 | AUC | |
|---|---|---|---|---|---|---|---|---|
| SecretomeP | Average stdev | 0.700 ±0.025 l.1e-08 | 0.726 ± 0.033 1.0e-06 | 0.719 ± 0.025 4.5e-08 | 0.713 ± 0.021 4.4e-09 | 0.426 ± 0.042 4.4e-09 | 0.709 ± 0.020 2.5e-09 | 0.709 ± 0.011 9.6e-20 |
| SRTpred | Average stdev | 0.710 ± 0.021 1.0e-07 | 0.721 ± 0.026 2.8e-05 | 0.718 ± 0.017 2.0e-06 | 0.715 ± 0.012 2.1e-07 | 0.431 ± 0.024 2.2e-07 | 0.714 ± 0.012 9.1e-08 | 0.714 ± 0.018 1.0e-14 |
| iMSP- | Average stdev | 0.718 ± 0.031 1.2e-07 | 0.730 ± 0.026 6.6e-05 | 0.727 ± 0.026 8.3e-06 | 0.724 ± 0.026 1.2e-06 | 0.449 ± 0.052 1.2e-06 | 0.723 ± 0.027 4.3e-07 | 0.795 ± 0.009 4.4e-13 |
| iMSP- | Average stdev | 0.733 ± 0.027 3.9e-06 | 0.735 ± 0.025 1.4e-04 | 0.735 ± 0.019 3.5e-05 | 0.734 ± 0.018 8.9e-06 | 0.469 ± 0.036 9.2e-06 | 0.734 ± 0.019 5.1e-06 | 0.817 ± 0.012 1.4e-10 |
| SCRIP | Average stdev | 0.754 ± 0.027 N/A | 0.765 ± 0.036 N/A | 0.763 ± 0.029 N/A | 0.759 ± 0.024 N/A | 0.519 ± 0.047 N/A | 0.758 ± 0.023 N/A | 0.844 ± 0.010 N/A |
Comparison of stat-of-the-art predictors with the proposed method on iMSP’s testing dataset.
| Predictor | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|
| SecretomeP | 0.632 | 0.787 | 0.762 | 0.340 | 0.764 |
| SRTpred | 0.678 | 0.802 | 0.782 | 0.392 | 0.770 |
| iMSP- | 0.631 | 0.866 | 0.829 | 0.443 | 0.821 |
| iMSP- | 0.538 | 0.908 | 0.850 | 0.441 | 0.817 |
| SCRIP | 0.716 | 0.884 | 0.858 | 0.537 | 0.865 |
Application of SCRIP to cancer biomarkers identification.
| Types of Cancer | Sensitivity | Specificity | Precision | Accuracy | MCC | F1 | AUC |
|---|---|---|---|---|---|---|---|
| Breast Cancer | 0.769 | 0.718 | 0.057 | 0.719 | 0.156 | 0.107 | 0.776 |
| Gastric Cancer | 0.733 | 0.820 | 0.193 | 0.815 | 0.311 | 0.306 | 0.804 |
| Lung Cancer | 0.733 | 0.666 | 0.045 | 0.667 | 0.120 | 0.085 | 0.792 |
| Pancreatic Cancer | 0.667 | 0.691 | 0.135 | 0.689 | 0.190 | 0.224 | 0.811 |