| Literature DB >> 34193950 |
Kai-Yao Huang1,2, Yi-Jhan Tseng1, Hui-Ju Kao1, Chia-Hung Chen1, Hsiao-Hsiang Yang1, Shun-Long Weng3,4,5.
Abstract
Anticancer peptides (ACPs) are a kind of bioactive peptides which could be used as a novel type of anticancer drug that has several advantages over chemistry-based drug, including high specificity, strong tumor penetration capacity, and low toxicity to normal cells. As the number of experimentally verified bioactive peptides has increased significantly, various of in silico approaches are imperative for investigating the characteristics of ACPs. However, the lack of methods for investigating the differences in physicochemical properties of ACPs. In this study, we compared the N- and C-terminal amino acid composition for each peptide, there are three major subtypes of ACPs that are defined based on the distribution of positively charged residues. For the first time, we were motivated to develop a two-step machine learning model for identification of the subtypes of ACPs, which classify the input data into the corresponding group before applying the classifier. Further, to improve the predictive power, the hybrid feature sets were considered for prediction. Evaluation by five-fold cross-validation showed that the two-step model trained with sequence-based features and physicochemical properties was most effective in discriminating between ACPs and non-ACPs. The two-step model trained with the hybrid features performed well, with a sensitivity of 86.75%, a specificity of 85.75%, an accuracy of 86.08%, and a Matthews Correlation Coefficient value of 0.703. Furthermore, the model also consistently provides the effective performance in independent testing set, with sensitivity of 77.6%, specificity of 94.74%, accuracy of 88.99% and the MCC value reached 0.75. Finally, the two-step model has been implemented as a web-based tool, namely iDACP, which is now freely available at http://mer.hc.mmh.org.tw/iDACP/ .Entities:
Year: 2021 PMID: 34193950 PMCID: PMC8245499 DOI: 10.1038/s41598-021-93124-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Analytical flowchart of iDACP including data collection, data preprocessing and type grouping, features investigation, feature sets combination, model construction and evaluation, and independent testing.
Data statistics of training and testing datasets after the removal of homologous sequences using CD-HIT program.
| Sequence identity cut-off | Number of ACPs | Number of non-ACPs |
|---|---|---|
| Raw data | 1354 | 2250 |
| Sequence length > 10aa | 1256 | 2250 |
| Sequence identity < 90% | 992 | 1980 |
| Training dataset | 800 | 1600 |
| Independent testing dataset | 192 | 380 |
aa amino acid, ACPs anti-cancer peptides, non-ACPs non-anti-cancer peptides.
Figure 2Investigation of composition of twenty amino acids of ACPs and non-ACPs.
Figure 3Investigation of composition of twenty amino acids between the N- and C-terminal regions of ACPs.
Figure 4The frequency differences of 20 × 20 amino acid pairs between ACPs and non-ACPs.
Figure 5Comparison of the physicochemical property profiles between ACPs and non-ACPs.
Five-fold cross validation results of the models trained with single feature.
| Feature | Sen. (%) | Spec. (%) | Acc. (%) | BAcc. (%) | MCC |
|---|---|---|---|---|---|
| AAC | 86.23 ± 0.56 | 87.25 ± 0.28 | 86.91 ± 0.23 | 86.74 ± 0.28 | 0.72 ± 0.01 |
| DPC | 85.38 ± 0.36 | 84.63 ± 0.50 | 84.88 ± 0.35 | 85.00 ± 0.30 | 0.68 ± 0.01 |
| CKSAAP, k = 1 | 85.98 ± 0.54 | 86.31 ± 0.46 | 86.20 ± 0.44 | 86.14 ± 0.45 | 0.70 ± 0.01 |
| CKSAAP, k = 2 | 85.50 ± 0.27 | 86.34 ± 0.41 | 86.06 ± 0.26 | 85.92 ± 0.21 | 0.70 ± 0.00 |
| CKSAAP, k = 3 | 86.63 ± 0.13 | 86.55 ± 0.26 | 86.58 ± 0.20 | 86.59 ± 0.17 | 0.71 ± 0.00 |
| PCP | 71.53 ± 0.95 | 71.05 ± 0.47 | 71.21 ± 0.60 | 71.29 ± 0.68 | 0.41 ± 0.01 |
Sen. Sensitivity, Spec. specificity, Acc. Accuracy, BAcc. balanced accuracy, MCC Matthews correlation coefficient. The values represent the mean and standard deviation of all measurements.
Figure 6Position-specific amino acid composition of the N- and C-terminal regions in the different subtypes of ACPs.
Data statistics for each type of ACP in the training and testing datasets.
| Dataset | ACPs | Non-ACPs |
|---|---|---|
| Training dataset | 800 | 1600 |
| Group C+ | 394 | 428 |
| Group N+ | 158 | 508 |
| Group other | 248 | 664 |
| Testing dataset | 192 | 380 |
| Group C+ | 94 | 103 |
| Group N+ | 39 | 131 |
| Group other | 59 | 146 |
Figure 7Physicochemical property profiles of N- and C-terminus in the different subtypes of ACPs.
Five-fold cross validation results of the two-step models trained with the single feature.
| Feature | Sen. (%) | Spec. (%) | Acc. (%) | BAcc. (%) | MCC |
|---|---|---|---|---|---|
| AAC | 85.45 ± 0.74 | 84.09 ± 0.78 | 84.54 ± 0.41 | 84.77 ± 0.32 | 0.67 ± 0.01 |
| DPC | 83.83 ± 0.60 | 80.24 ± 0.95 | 81.43 ± 0.61 | 82.03 ± 0.50 | 0.61 ± 0.01 |
| CKSAAP, k = 1 | 84.65 ± 0.92 | 82.33 ± 0.49 | 83.10 ± 0.40 | 83.49 ± 0.48 | 0.64 ± 0.01 |
| CKSAAP, k = 2 | 83.95 ± 1.02 | 82.39 ± 0.44 | 82.91 ± 0.51 | 83.17 ± 0.61 | 0.64 ± 0.01 |
| CKSAAP, k = 3 | 84.75 ± 0.72 | 82.68 ± 0.76 | 83.37 ± 0.45 | 83.71 ± 0.38 | 0.65 ± 0.01 |
| PCP | 73.55 ± 1.82 | 69.25 ± 1.99 | 70.68 ± 1.13 | 71.40 ± 0.93 | 0.41 ± 0.02 |
Sen. Sensitivity, Spec. specificity, Acc. Accuracy, BAcc. balanced accuracy, MCC Matthews correlation coefficient. The values represent the mean and standard deviation of all measurements.
Five-fold cross validation results of the two-step models trained with the hybrid feature sets.
| Feature | Sen. (%) | Spec. (%) | Acc. (%) | BAcc. (%) | MCC |
|---|---|---|---|---|---|
| AAC + DPC | 85.85 ± 0.66 | 85.45 ± 0.25 | 85.58 ± 0.36 | 85.65 ± 0.43 | 0.69 ± 0.01 |
| AAC + DPC + PCP | 86.05 ± 0.63 | 85.78 ± 0.25 | 85.87 ± 0.27 | 85.91 ± 0.34 | 0.70 ± 0.01 |
| AAC + DPC + CKSAAP | 86.18 ± 0.49 | 84.28 ± 0.29 | 84.91 ± 0.29 | 85.23 ± 0.33 | 0.68 ± 0.01 |
| AAC + DPC + CKSAAP + PCP | 86.03 ± 0.58 | 84.79 ± 0.27 | 85.20 ± 0.33 | 85.41 ± 0.38 | 0.68 ± 0.01 |
Sen. Sensitivity, Spec. specificity, Acc. Accuracy, BAcc. balanced accuracy, MCC Matthews correlation coefficient. The values represent the mean and standard deviation of all measurements.
Comparison of independent testing results between our method and the available prediction tools.
| Tools | Sensitivity (%) | Specificity (%) | Accuracy (%) | B. Accuracy (%) | MCC |
|---|---|---|---|---|---|
| iDACP | 77.60 | 94.74 | 88.99 | 86.17 | 0.75 |
| ACPred | 75.97 | 84.21 | 81.84 | 80.09 | 0.58 |
| ACPred-FL | 57.79 | 25.79 | 35.02 | 41.79 | − 0.16 |
| Anti-CP | 100 | 0.29 | 34.03 | 50.15 | 0.03 |
| iACP | 65.10 | 75.53 | 72.03 | 70.32 | 0.40 |
| mACPpred | 71.35 | 94.47 | 86.71 | 82.91 | 0.70 |
B. Accuracy. balanced accuracy, MCC Matthews correlation coefficient.