| Literature DB >> 33959153 |
Gai-Fang Dong1, Lei Zheng2, Sheng-Hui Huang2, Jing Gao1, Yong-Chun Zuo2.
Abstract
Antimicrobial peptides (AMPs) are considered as potential substitutes of antibiotics in the field of new anti-infective drug design. There have been several machine learning algorithms and web servers in identifying AMPs and their functional activities. However, there is still room for improvement in prediction algorithms and feature extraction methods. The reduced amino acid (RAA) alphabet effectively solved the problems of simplifying protein complexity and recognizing the structure conservative region. This article goes into details about evaluating the performances of more than 5,000 amino acid reduced descriptors generated from 74 types of amino acid reduced alphabet in the first stage and the second stage to construct an excellent two-stage classifier, Identification of Antimicrobial Peptides by Reduced Amino Acid Cluster (iAMP-RAAC), for identifying AMPs and their functional activities, respectively. The results show that the first stage AMP classifier is able to achieve the accuracy of 97.21 and 97.11% for the training data set and independent test dataset. In the second stage, our classifier still shows good performance. At least three of the four metrics, sensitivity (SN), specificity (SP), accuracy (ACC), and Matthews correlation coefficient (MCC), exceed the calculation results in the literature. Further, the ANOVA with incremental feature selection (IFS) is used for feature selection to further improve prediction performance. The prediction performance is further improved after the feature selection of each stage. At last, a user-friendly web server, iAMP-RAAC, is established at http://bioinfor.imu.edu. cn/iampraac.Entities:
Keywords: antimicrobial peptide; identification; reduced amino acid alphabet; supporting vector machine; two-stage classifier
Year: 2021 PMID: 33959153 PMCID: PMC8093877 DOI: 10.3389/fgene.2021.669328
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1The overall framework of our classifier. Training data set from DS1 or seven training data sets from DS2 are computed separately through amino acid reduction, dipeptide feature extraction, supporting vector machine model training and 10-fold cross-validation model evaluation. Then, the best feature file with the highest accuracy and the corresponding reduction type and cluster are determined. Next, the best features after feature selection or features from the best feature file are used for model training. Finally, on the one hand, the independent test set is used for testing performances of model; on the other hand, the web server is constructed with the trained model to provide two-stage prediction service.
The Number of AMPs of seven AMP functional activities on training set and testing set for DS1 and DS2.
| Activities | Positive samples (training/testing) | Negative samples (training/testing) |
| Anti-parasitic | 140/60 | 700/1,914 |
| Anti-viral | 1,400/601 | 2,451/1,374 |
| Anti-cancer | 219/94 | 1,095/1,881 |
| Targeting mammals | 215/93 | 1,075/1,882 |
| Anti-fungal | 1,912/820 | 1,261/1,155 |
| TGPB | 1,930/828 | 1,624/1,147 |
| TGNB | 1,931/828 | 1,635/1,147 |
Reduction descriptors when reduced type is 1 and cluster size are 2–19.
| Cluster Size | Reduced amino acid cluster | Sequence after reduction |
| 2 | LVIMCAGSTPFYW-EDNQKRH | LEEELLLLLELELLELEL |
| 3 | LASGVTIPMC-EKRDNQH-FYW | LEEELLFLLELELLELEL |
| 4 | LVIMC-AGSTP-FYW-EDNQKRH | AEEEALFLAEAEAAEAEL |
| 5 | LVIMC-AGSTP-FYW-EDNQ-KRH | AEEKALFLAEAKAAKAKL |
| 6 | LVIM-AGST-PHC-FYW-EDNQ-KR | AEEKPLFLPEPKPPPPKL |
| 8 | LVIMC-AG-ST-P-FYW-EDNQ-KR-H | AEEKPLFLPEPKPPHPKL |
| 10 | LVIM-C-A-G-ST-P-FYW-EDNQ-KR-H | GEEKPLFLPEPKPPHPKL |
| 12 | LVIM-C-A-G-ST-P-FY-W-EQ-DN-KR-H | GDDKPLFLPEPKPPHPKL |
| 15 | LVIM-C-A-G-S-T-P-FY-W-E-D-N-Q- KR-H | GNNKPLFLPQPKPPHPKL |
| 18 | LM-VI-C-A-G-S-T-P-F-Y-W-E-D-N-Q-K-R-H | GNNRPVYVPQPRPPHPRV |
| 20 | L-V-I-M-C-A-G-S-T-P-F-Y-W-E-D-N-Q-K-R-H | GNNRPVYIPQPRPPHPRI (original sequence) |
Performance comparisons of iAMP-RAAC and the other three methods on training set in DS1 based on 10-fold cross-validation.
| Method | SN (%) | SP (%) | ACC of BFS/ACC of AFS | MCC (%) | Number of features for BFS/number of features for AFS |
| iAMP-RAAC | 84.30 | 98.94 | 97.21%/ 97.23% | 82.84 | 361/336 |
| AMPfun ( | 94.88 | 95.11 | 95.09%/− | 77.06 | 9,367/2,452 |
| SVM | 94.33 | 94.29 | 94.3%/− | 74.47 | −/− |
| DT | 83.40 | 98.26 | 96.87%/− | 81.47 | −/− |
FIGURE 2Heat map of ACC values with reduced types from 1 to 20 and cluster size of 2 to 19 on training dataset in DS1. In general, the color gradient from green to red indicates the increasing trend of the values of ACC, and the areas with “None” indicate that there are no such reduction descriptors at the intersections of the corresponding reduction types and cluster sizes.
FIGURE 3Feature selection process when reduction type is 5 and cluster size is 19 in the first stage on training set in DS1. The horizontal axis represents the number of features, and the vertical axis represents the value of ACC. The number of selected features and the value of corresponding ACC are marked on the curve.
Performance comparisons of iAMP-RAAC and the other method on independent test set in DS1.
| Method | SN (%) | SP (%) | ACC (%) | MCC (%) | AUC (%) | Number of Features |
| iAMP-RAAC | 88.44 | 97.91 | 97.11 | 82.24 | 98.47 | 361 |
| AMPfun | – | – | – | – | 98.94 | 2,452 |
Performance comparisons of iAMP-RAAC and RF (Chung et al., 2019) on training set in DS2 in the seven different AMP functional activities based on 10-fold cross-validation.
| Activity | Method | SN (%) | SP (%) | ACC (%) | MCC (%) |
| Anti-parasitic | iAMP-RAAC | 50.00 | 96.43 | 88.69 | 54.65 |
| RF | 75.26 | 83.66 | 82.02 | 49.55 | |
| Anti-viral | iAMP-RAAC | 88.21 | 94.70 | 92.34 | 83.41 |
| RF | 91.09 | 93.24 | 92.47 | 83.82 | |
| Anti-cancer | iAMP-RAAC | 52.12 | 97.99 | 90.34 | 61.19 |
| RF | 76.73 | 78.88 | 78.55 | 45.07 | |
| Targeting mammals | iAMP-RAAC | 69.72 | 96.93 | 92.40 | 71.20 |
| RF | 86.77 | 88.93 | 88.53 | 66.20 | |
| Anti-fungal | iAMP-RAAC | 91.27 | 78.58 | 86.23 | 71.04 |
| RF | 85.73 | 85.53 | 85.65 | 70.50 | |
| TGPB | iAMP-RAAC | 89.90 | 88.61 | 89.31 | 78.51 |
| RF | 88.52 | 88.48 | 88.51 | 76.87 | |
| TGNB | iAMP-RAAC | 90.58 | 87.83 | 89.32 | 78.50 |
| RF | 88.05 | 88.15 | 88.09 | 76.06 |
FIGURE 4Result of feature selections for seven AMP functional activities. The horizontal axis represents the number of features, and the vertical axis represents the value of ACC. The number of selected features and the value of corresponding ACC are marked on the curve.
Performance comparisons of iAMP-RAAC and other methods on independent test set in DS2 in the seven different AMP functional activities.
| Activity | Method | SN (%) | SP (%) | ACC (%) | MCC (%) |
| Anti-parasitic | iAMP-RAAC | 14.10 | 97.91 | 91.29 | 18.88 |
| AMPfun | 61.67 | 77.32 | 76.85 | 15.70 | |
| Anti-viral | iAMP-RAAC | 76.64 | 95.05 | 88.51 | 74.58 |
| AMPfun | 90.85 | 84.06 | 86.13 | 70.75 | |
| iAMPpred ( | 31.28 | 39.59 | 37.06 | -26.82 | |
| AVPpred ( | 24.09 | 88.57 | 69.01 | 16.43 | |
| Anti-cancer | iAMP-RAAC | 30.48 | 97.93 | 91.54 | 39.07 |
| AMPfun | 77.66 | 70.60 | 70.94 | 22.08 | |
| MLACP ( | 72.34 | 75.12 | 74.99 | 22.72 | |
| Targeting mammals | iAMP-RAAC | 25.66 | 98.00 | 89.72 | 35.56 |
| AMPfun | 78.49 | 80.45 | 80.35 | 29.98 | |
| Anti-fungal | iAMP-RAAC | 63.61 | 91.21 | 74.73 | 54.57 |
| AMPfun | 85.61 | 66.75 | 74.58 | 51.86 | |
| iAMPpred ( | 66.10 | 72.12 | 69.62 | 37.96 | |
| TGPB | iAMP-RAAC | 67.03 | 90.09 | 77.16 | 57.45 |
| AMPfun | 88.77 | 63.73 | 74.23 | 52.54 | |
| TGNB | iAMP-RAAC | 68.28 | 89.37 | 77.92 | 58.21 |
| AMPfun | 85.75 | 65.74 | 74.13 | 51.16 |