| Literature DB >> 34276801 |
Xian-Gan Chen1,2,3, Wen Zhang4,5, Xiaofei Yang1,2,3, Chenhong Li1,2,3, Hengling Chen1,2,3.
Abstract
Anticancer peptides (ACPs) have provided a promising perspective for cancer treatment, and the prediction of ACPs is very important for the discovery of new cancer treatment drugs. It is time consuming and expensive to use experimental methods to identify ACPs, so computational methods for ACP identification are urgently needed. There have been many effective computational methods, especially machine learning-based methods, proposed for such predictions. Most of the current machine learning methods try to find suitable features or design effective feature learning techniques to accurately represent ACPs. However, the performance of these methods can be further improved for cases with insufficient numbers of samples. In this article, we propose an ACP prediction model called ACP-DA (Data Augmentation), which uses data augmentation for insufficient samples to improve the prediction performance. In our method, to better exploit the information of peptide sequences, peptide sequences are represented by integrating binary profile features and AAindex features, and then the samples in the training set are augmented in the feature space. After data augmentation, the samples are used to train the machine learning model, which is used to predict ACPs. The performance of ACP-DA exceeds that of existing methods, and ACP-DA achieves better performance in the prediction of ACPs compared with a method without data augmentation. The proposed method is available at http://github.com/chenxgscuec/ACPDA.Entities:
Keywords: anticancer peptide prediction; data augmentation; feature representation; machine learning; multilayer perception
Year: 2021 PMID: 34276801 PMCID: PMC8279753 DOI: 10.3389/fgene.2021.698477
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Flowchart of ACP-DA. Binary profile features (BPFs) and AAindex features after feature selection were concatenated to represent peptides, and the samples in the training set were augmented in the feature space. The samples after data augmentation were used to train the multilayer perception (MLP) model, which was used for the prediction of anticancer peptides (ACPs).
FIGURE 2Sequence length statistics for all peptides in the ACP740 and ACP240 dataset.
Performance of ACP-DA with different parameters based on ACP740 (The best metrics are in bold).
| 40 | 100% | 81.89 | 80.59 | 83.23 | ||
| 40 | 200% | 82.02 | 83.46 | 80.89 | 64.56 | |
| 40 | 300% | 81.49 | 82.89 | 80.88 | 82.15 | 63.40 |
| 50 | 100% | 80.41 | 83.35 | 79.02 | 81.88 | 62.59 |
| 50 | 200% | 81.51 | 84.57 | 79.36 | 64.68 | |
| 50 | 300% | 80.27 | 77.23 | 73.35 | 61.17 | |
| 60 | 100% | 79.19 | 80.18 | 79.54 | 78.85 | 58.89 |
| 60 | 200% | 78.37 | 77.72 | 81.67 | 75.01 | 57.21 |
| 60 | 300% | 79.73 | 79.14 | 81.93 | 77.47 | 59.61 |
Performance of ACP-DA with different parameters based on ACP240 (The best metrics are in bold).
| 40 | 100% | 85.42 | 83.43 | 92.28 | 77.59 | 71.57 |
| 40 | 200% | 87.92 | 87.17 | 91.48 | 83.91 | 76.03 |
| 40 | 300% | 88.37 | ||||
| 50 | 100% | 85.00 | 84.71 | 88.43 | 81.11 | 70.10 |
| 50 | 200% | 83.75 | 84.80 | 86.12 | 81.10 | 68.10 |
| 50 | 300% | 85.42 | 86.48 | 86.86 | 83.83 | 71.03 |
| 60 | 100% | 86.25 | 84.35 | 92.28 | 79.37 | 72.97 |
| 60 | 200% | 87.08 | 86.89 | 90.74 | 83.04 | 74.64 |
| 60 | 300% | 87.92 | 85.70 | 81.11 | 76.26 |
FIGURE 3Comparison of prediction models using BPFs, the AAindex, the k-mer sparse matrix (k-mer), and their concatenations based on ACP740 and ACP240.
FIGURE 4Comparison of the prediction models with and without data augmentation based on (A) ACP740 and (B) ACP240.
FIGURE 5Comparison of ACP-DA with existing methods on (A) ACP740 and (B) ACP240.