| Literature DB >> 31121946 |
Nalini Schaduangrat1, Chanin Nantasenamat2, Virapong Prachayasittikul3, Watshara Shoombuatong4.
Abstract
Anticancer peptides (ACPs) have emerged as a new class of therapeutic agent for cancer treatment due to their lower toxicity as well as greater efficacy, selectivity and specificity when compared to conventional small molecule drugs. However, the experimental identification of ACPs still remains a time-consuming and expensive endeavor. Therefore, it is desirable to develop and improve upon existing computational models for predicting and characterizing ACPs. In this study, we present a bioinformatics tool called the ACPred, which is an interpretable tool for the prediction and characterization of the anticancer activities of peptides. ACPred was developed by utilizing powerful machine learning models (support vector machine and random forest) and various classes of peptide features. It was observed by a jackknife cross-validation test that ACPred can achieve an overall accuracy of 95.61% in identifying ACPs. In addition, analysis revealed the following distinguishing characteristics that ACPs possess: (i) hydrophobic residue enhances the cationic properties of α-helical ACPs resulting in better cell penetration; (ii) the amphipathic nature of the α-helical structure plays a crucial role in its mechanism of cytotoxicity; and (iii) the formation of disulfide bridges on β-sheets is vital for structural maintenance which correlates with its ability to kill cancer cells. Finally, for the convenience of experimental scientists, the ACPred web server was established and made freely available online.Entities:
Keywords: anticancer peptide; classification; machine learning; random forest; support vector machine; therapeutic peptides
Mesh:
Substances:
Year: 2019 PMID: 31121946 PMCID: PMC6571645 DOI: 10.3390/molecules24101973
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1Overview of the structural diversity of three classes of anticancer peptides. Each structure is labeled by its common name followed by the Protein Data Bank (PDB ID) in parenthesis on the subsequent line. In cases where the PDB ID was not available, the SWISS-MODEL server (available at: https://swissmodel.expasy.org/) was used to construct the structure.
Summary of existing methods for predicting anticancer peptides.
| Method (Year) | Classifier a | Sequence Features b | Interpretable | Web Server |
|---|---|---|---|---|
| AntiCP (2013) | SVM | AAC, DPC, BP | No | ✓ |
| Hajisharifi et al. (2014) | SVM | PseACC, LAK | No | |
| ACPP (2015) | SVM | PRM | No | ✓ c |
| iACP (2016) | SVM | g-gap DPC |
| ✓ |
| Li and Wang (2016) | SVM | AAC, RACC, acACS | No | |
| iACP-GAEnsC (2017) | Ensemble method | Pse-g-gap DPC, Am-PseAAC, RACC | No | |
| MLACP (2017) | RF | AAC, ATC, DPC, PCP |
| ✓ |
| SAP (2018) | SVM | g-gap DPC | No | |
| TargetACP (2018) | SVM | CPSR, SAAC, PsePSSM | No | |
| ACPred (this study) | SVM | AAC, DPC, PCP, PseAAC, Am-PseAAC |
| ✓ |
a RF: random forest, SVM: support vector machine. b AAC: amino acid composition, ATC: atomic composition, acACS: auto covariance of the average chemical shift, Am-PseAAC: amphiphilic pseudo amino acid composition, BP: binary profile, CPDR: composite protein sequence representation, DPC: dipeptide composition, g-gap DPC: G-Gap dipeptide composition, LAK: local alignment kernel, PCP: Physicochemical properties, PseACC: Pseudo amino acid composition, Pse-g-gap DPC: Pseudo G-Gap dipeptide composition, PRM: protein relatedness measure, RACC: reduce amino acid composition, SAAC: split amino acid composition. c The web server is not accessible.
Figure 2Schematic framework of ACPred.
Summary of MODI index as derived from various types of peptide features on the benchmark dataset.
| Feature | MODI |
|---|---|
| AAC | 0.868 |
| DPC | 0.836 |
| PCP | 0.722 |
| PseAAC | 0.859 |
| Am-PseAAC | 0.859 |
| AAC + PseAAC | 0.859 |
| AAC + Am-PseAAC | 0.859 |
| PseAAC + Am-PseAAC | 0.856 |
| AAC + PseAAC + Am-PseAAC | 0.856 |
Performance comparison of SVM and RF with various types of sequence features over five-fold cross-validation.
| Feature | Classifier |
|
|
|
|
|
|---|---|---|---|---|---|---|
| AAC | SVM | 92.69 | 83.94 | 98.54 | 0.850 | 0.977 |
| RF | 91.23 | 92.80 | 90.32 | 0.817 | 0.958 | |
| DPC | SVM | 83.92 | 100.00 | 78.85 | 0.687 | 0.942 |
| RF | 87.14 | 91.89 | 84.85 | 0.733 | 0.944 | |
| PCP | SVM | 84.80 | 63.50 | 99.02 | 0.698 | 0.938 |
| RF | 83.63 | 90.10 | 80.91 | 0.661 | 0.872 | |
| PseAAC | SVM | 92.98 | 84.67 | 98.54 | 0.856 | 0.990 |
| RF | 92.11 | 93.65 | 91.20 | 0.835 | 0.959 | |
| Am-PseAAC | SVM | 95.03 | 87.59 | 100.00 | 0.899 | 0.995 |
| RF | 92.40 | 100.00 | 88.75 | 0.848 | 0.974 | |
| AAC + PseAAC | SVM | 93.57 | 85.40 | 99.02 | 0.869 | 0.991 |
| RF | 92.98 | 96.69 | 90.95 | 0.855 | 0.964 | |
| AAC + Am-PseAAC | SVM | 95.32 | 89.05 | 99.51 | 0.904 | 0.994 |
| RF | 93.28 | 97.50 | 90.99 | 0.862 | 0.969 | |
| PseAAC + Am-PseAAC | SVM | 94.44 | 86.86 | 99.51 | 0.887 | 0.994 |
| RF | 92.69 | 98.28 | 89.82 | 0.851 | 0.967 | |
| AAC + PseAAC + Am-PseAAC | SVM | 94.15 | 86.13 | 99.51 | 0.881 | 0.993 |
| RF | 92.98 | 98.29 | 90.22 | 0.857 | 0.972 |
Performance comparison of SVM and RF with various types of sequence features over jackknife test.
| Feature | Classifier |
|
|
|
|
|
|---|---|---|---|---|---|---|
| AAC | SVM | 92.98 | 83.94 | 99.02 | 0.857 | 0.978 |
| RF | 91.23 | 92.80 | 90.32 | 0.817 | 0.959 | |
| DPC | SVM | 85.09 | 98.86 | 80.32 | 0.706 | 0.941 |
| RF | 86.84 | 89.66 | 85.40 | 0.725 | 0.947 | |
| PCP | SVM | 84.80 | 63.50 | 99.02 | 0.698 | 0.937 |
| RF | 83.63 | 88.57 | 81.44 | 0.659 | 0.868 | |
| PseAAC | SVM | 92.98 | 84.67 | 98.54 | 0.856 | 0.990 |
| RF | 93.28 | 97.50 | 90.99 | 0.862 | 0.960 | |
| Am-PseAAC | SVM | 95.03 | 87.59 | 100.00 | 0.899 | 0.995 |
| RF | 92.40 | 99.12 | 89.08 | 0.847 | 0.969 | |
| AAC + PseAAC | SVM | 93.57 | 85.40 | 99.02 | 0.869 | 0.990 |
| RF | 93.28 | 98.31 | 90.63 | 0.863 | 0.962 | |
| AAC + Am-PseAAC | SVM | 95.61 | 89.78 | 99.51 | 0.910 | 0.994 |
| RF | 93.57 | 98.32 | 91.03 | 0.869 | 0.967 | |
| PseAAC + Am-PseAAC | SVM | 93.86 | 85.40 | 99.51 | 0.875 | 0.992 |
| RF | 93.57 | 99.15 | 90.67 | 0.870 | 0.959 | |
| AAC + PseAAC + Am-PseAAC | SVM | 94.74 | 87.59 | 99.51 | 0.893 | 0.994 |
| RF | 92.98 | 99.13 | 89.87 | 0.858 | 0.973 |
Figure 3ROC curve of RF (top and bottom left) and SVM (top and bottom right) models as assessed by 5-fold cross-validation (top left and right) and jackknife test or leave-one-out cross-validation (bottom left and right).
Performance comparison of the proposed ACPred model with existing methods.
| Method a |
|
|
|
|
|---|---|---|---|---|
| Hajisharifi et al. b | 92.68 | 89.70 | 85.18 | 0.78 |
| iACP b | 95.06 | 89.86 | 98.54 | 0.90 |
| iACP-GAEnsC b | 96.45 | 95.36 | 97.54 | 0.91 |
| TargetACP b | 96.22 | 94.20 | 97.57 | 0.92 |
| TargetACP c | 98.78 | 99.02 | 98.54 | 0.97 |
| ACPred b | 95.61 | 89.78 | 99.51 | 0.91 |
| ACPred-modified c | 97.56 | 96.08 | 99.02 | 0.95 |
a Results were reported from the work of TargetACP. b Results were performed on the benchmark dataset consisting of 138 ACPs and 205 non-ACPs. c Results were performed on the balanced dataset consisting of 205 ACPs and 205 non-ACPs by using the SMOTE technique on the benchmark dataset.
Amino acid compositions (%) of anticancer and non-anticancer peptides along with their difference as well as MDGI values. The rank of each amino acid amongst their 20 amino acids are shown in parenthesis for AAC difference and MDGI.
| Amino Acid | ACP (%) | Non-ACP (%) | Difference | MDGI | |
|---|---|---|---|---|---|
| A-Ala | 7.623 | 11.005 | −3.383 (7) | <0.05 | 6.41 (11) |
| C-Cys | 3.906 | 8.015 | −4.109 (5) | <0.05 | 19.71 (2) |
| D-Asp | 2.417 | 3.418 | −1.002 (15) | <0.05 | 3.85 (15) |
| E-Glu | 1.707 | 3.523 | −1.816 (12) | <0.05 | 6.88 (9) |
| F-Phe | 6.823 | 2.41 | 4.413 (2) | <0.05 | 6.12 (12) |
| G-Gly | 1.975 | 4.123 | −2.148 (11) | <0.05 | 6.56 (10) |
| H-His | 1.536 | 5.798 | −4.262 (4) | <0.05 | 7.80 (7) |
| I-Ile | 10.072 | 6.98 | 3.092 (9) | <0.05 | 10.13 (5) |
| K-Lys | 2.542 | 1.651 | 0.892 (17) | 0.086 | 29.54 (1) |
| L-Leu | 8.099 | 3.739 | 4.36 (3) | <0.05 | 7.84 (6) |
| M-Met | 9.831 | 13.888 | −4.057 (6) | <0.05 | 4.34 (14) |
| N-Asn | 11.497 | 3.964 | 7.533 (1) | <0.05 | 4.98 (13) |
| P-Pro | 0.905 | 2.224 | −1.319 (13) | <0.05 | 7.37 (8) |
| Q-Gln | 5.385 | 2.711 | 2.674 (10) | <0.05 | 15.89 (3) |
| R-Arg | 4.211 | 7.495 | −3.283 (8) | <0.05 | 11.42 (4) |
| S-Ser | 6.537 | 5.832 | 0.705 (19) | 0.245 | 3.50 (16) |
| T-Thr | 3.781 | 4.704 | −0.923 (16) | 0.098 | 3.13 (17) |
| V-Val | 2.258 | 1.560 | 0.698 (20) | 0.083 | 2.96 (18) |
| W-Trp | 2.244 | 1.423 | 0.821 (18) | <0.05 | 2.51 (20) |
| Y-Tyr | 6.65 | 5.539 | 1.111 (14) | 0.091 | 2.90 (19) |
Figure 4Heat map of the mean decrease of Gini index of dipeptide compositions.
Ten top-ranked physiocochemical properties from the AAindex having the highest MDGI values.
| Rank | AAindex | Categorized Property | Description | MDGI |
|---|---|---|---|---|
| 1 | ARGP820101 | Hydrophobicity | Hydrophobicity index (Argos et al., 1982) | 1.51 |
| 2 | ARGP820102 | Hydrophobicity | Signal sequence helical potential (Argos et al., 1982) | 1.40 |
| 3 | BHAR880101 | Hydrophobicity | Average flexibility indices (Bhaskaran-Ponnuswamy, 1988) | 1.08 |
| 4 | ARGP820103 | Hydrophobicity | Membrane-buried preference parameters (Argos et al., 1982) | 1.04 |
| 5 | BEGF750102 | Beta propensity | Conformational parameter of beta-structure (Beghin-Dirkx, 1975) | 1.00 |
| 6 | BEGF750101 | Alpha and turn propensities | Conformational parameter of inner helix (Beghin-Dirkx, 1975) | 0.92 |
| 7 | BIGC670101 | Physicochemical properties | Residue volume (Bigelow, 1967) | 0.91 |
| 8 | BEGF750103 | Alpha and turn propensities | Conformational parameter of beta-turn (Beghin-Dirkx, 1975) | 0.87 |
| 9 | BIOV880102 | Hydrophobicity | Information value for accessibility; average fraction 23% (Biou et al., 1988) | 0.85 |
| 10 | ISOY800107 | Hydrophobicity | Normalized relative frequency of double bend (Isogai et al., 1980) | 0.83 |
Eight if-then rules for the prediction of anticancer peptides using random forest and amino acid composition.
| No. | Rule | Covered Samples | Misclassified Sample | Ac (%) |
|---|---|---|---|---|
| 1 | G > 0.041 and I > 0.0615 and L ≤ 0.1715 and K > 0.0385 and M ≤ 0.027 | 48 | 1 | 97.92 |
| 2 | R ≤ 0.0515 and Q ≤ 0.026 and K > 0.094 | 48 | 1 | 97.92 |
| 3 | C > 0.1145 and P ≤ 0.073 | 35 | 0 | 100.00 |
| 4 | L ≤ 0.093 and F > 0.0715 and S ≤ 0.152 | 33 | 4 | 87.88 |
| 5 | A ≤ 0.0145 and Q ≤ 0.026 | 26 | 4 | 84.62 |
| 6 | R ≤ 0.0665 and E ≤ 0.044 and H > 0.052 | 21 | 4 | 80.95 |
| 7 | I > 0.108 and K > 0.055 | 42 | 6 | 85.71 |
| 8 | E ≤ 0.0545 and G > 0.0365 and K > 0.108 and M ≤ 0.04 | 49 | 1 | 97.96 |
Twenty if-then rules for discriminating ACP from non-ACP using random forest and amino acid composition.
| % Covered Samples | Rule | Prediction Result |
|---|---|---|
| 14.33 | R > 0.017 and C ≤ 0.1145 and H ≤ 0.0645 and L > 0.1605 and K ≤ 0.1035 | non-ACP |
| 13.74 | G > 0.041 and I > 0.0615 and L ≤ 0.1715 and K > 0.0385 and M ≤ 0.027 | ACP |
| 9.65 | E > 0.058 and G ≤ 0.068 | non-ACP |
| 7.89 | R ≤ 0.0515 and Q ≤ 0.026 and K > 0.094 | ACP |
| 7.89 | C ≤ 0.1145 and H ≤ 0.0645 and K ≤ 0.1055 and W > 0.028 | non-ACP |
| 5.26 | C > 0.1145 and P ≤ 0.073 | ACP |
| 4.97 | N > 0.019 and K ≤ 0.1705 and M > 0.037 | non-ACP |
| 4.09 | Q > 0.098 and K ≤ 0.0575 | non-ACP |
| 3.22 | C > 0.015 and C ≤ 0.043 | non-ACP |
| 3.80 | L ≤ 0.093 and F > 0.0715 and S ≤ 0.152 | ACP |
| 6.43 | R > 0.013 and R ≤ 0.1345 and C ≤ 0.1205 and H ≤ 0.0465 and F ≤ 0.1345 | non-ACP |
| 2.63 | A ≤ 0.0145 and Q ≤ 0.026 | ACP |
| 3.22 | I ≤ 0.0335 and P > 0.074 | non-ACP |
| 2.34 | R ≤ 0.0665 and E ≤ 0.044 and H > 0.052 | ACP |
| 1.75 | N ≤ 0.0565 and H ≤ 0.0415 and K≤ 0.0335 | non-ACP |
| 1.46 | I > 0.108 and K > 0.055 | ACP |
| 1.17 | E ≤ 0.0545 and G > 0.0365 and K > 0.108 and M ≤ 0.04 | ACP |
| 2.05 | I ≤ 0.0665 and F ≤ 0.006 | non-ACP |
| 2.05 | S ≤ 0.04 | non-ACP |
| 2.05 | Else | ACP |
Figure 5Screenshots of the ACPred web server before (A) and after (B) submission of sequence data for prediction.