| Literature DB >> 34330209 |
Onkar Singh1,2,3, Wen-Lian Hsu1,2, Emily Chia-Yu Su4,5.
Abstract
BACKGROUND: Antimicrobial peptides (AMPs) are oligopeptides that act as crucial components of innate immunity, naturally occur in all multicellular organisms, and are involved in the first line of defense function. Recent studies showed that AMPs perpetuate great potential that is not limited to antimicrobial activity. They are also crucial regulators of host immune responses that can modulate a wide range of activities, such as immune regulation, wound healing, and apoptosis. However, a microorganism's ability to adapt and to resist existing antibiotics triggered the scientific community to develop alternatives to conventional antibiotics. Therefore, to address this issue, we proposed Co-AMPpred, an in silico-aided AMP prediction method based on compositional features of amino acid residues to classify AMPs and non-AMPs.Entities:
Keywords: Amino acid composition; Antimicrobial peptide; Composition-based feature; Machine learning
Mesh:
Substances:
Year: 2021 PMID: 34330209 PMCID: PMC8325260 DOI: 10.1186/s12859-021-04305-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Two-sample logo shows the preference of positively charged and hydrophobic residues in antimicrobial peptides (AMPs) and non-AMPs at different positions. The first 15 positions represent the N-terminus of peptides, and the last 15 positions represent the C-terminus of peptides
Fig. 2Average percentages of amino acid compositions (AACs) in antimicrobial peptides (AMPs) and non-AMPs
Performances of machine-learning models on the benchmark training and independent test datasets. Values shown are mean ± SD for the training dataset
| Algorithm | Dataset | Accuracy | AUROC | Recall | Precision | Kappa | MCC |
|---|---|---|---|---|---|---|---|
| GBC | Training | 75.0% ± 0.038 | 0.816 ± 0.035 | 77.4% ± 0.082 | 73.9% ± 0.033 | 0.500 ± 0.0755 | 0 ± 0.075 |
| Test | |||||||
| CatBoost | Training | 74.4% ± 0.055 | 0.815 ± 0.045 | 75.3% ± 0.107 | 73.9% ± 0.045 | 0.488 ± 0.110 | 0.492 ± 0.109 |
| Test | 78.7% | 0.879 | 78.7% | 78.7% | 0.574 | 0.574 | |
| LGBM | Training | 73.8% ± 0.060 | 0.810 ± 0.052 | 73.3% ± 0.124 | 73.8% ± 0.039 | 0.476 ± 0.102 | 0.479 ± 0.099 |
| Test | 77.6% | 0.868 | 78.7% | 77.0% | 0.553 | 0.553 | |
| ETC | Training | 74.3% ± 0.055 | 0.794 ± 0.066 | 75.0% ± 0.097 | 73.9% ± 0.049 | 0.487 ± 0.109 | 0.491 ± 0.108 |
| Test | 77.6% | 0.776 | 77.6% | 77.6% | 0.553 | 0.553 | |
| RF | Training | 74.1% ± 0.044 | 0.798 ± 0.052 | 75.5% ± 0.101 | 73.1% ± 0.039 | 0.482 ± 0.088 | 0.487 ± 0.086 |
| Test | 78.1% | 0.811 | 78.7% | 77.8% | 0.563 | 0.563 |
The given data in bold font indicates the top performance of the model on the test dataset
GBC, gradient boosting classifier; LGBM, light gradient boosting machine; ETC, extra trees classifier; RF, random forest; AUROC, area under the receiver operating characteristics curve; MCC, Mathew's correlation coefficient; SD, standard deviation
Fig. 3The area under the receiver operating characteristic (AUROC) curve shows the models' performance developed using selected features on the independent test dataset
Fig. 4Sequence length distribution between training and test dataset. a) Sequence length distribution between DEEP-AMP30 training and independent test dataset. b) Sequence length distribution between DEEP-AMP30 training and iAMP-2L independent test dataset
List of all descriptors along with their abbreviations and numbers of features
| Feature type | Descriptor | Abbreviation | No. of features |
|---|---|---|---|
| Simple composition | Amino acid composition | AAC | 20 |
| Dipeptide composition | DPC | 400 | |
| Atom-type composition | ATC | 5 | |
| Bond-type composition | BTC | 4 | |
| Physicochemical properties | Amino acid index | AAI | 553 |
| Physicochemical property | PCP | 30 | |
| Distribution & repeats | Distance distribution of repeats | DDR | 20 |
| Residue repeat information | RRI | 20 | |
| Property repeat index | PRI | 24 | |
| Shannon entropy | Shannon entropy of a residue | SER | 20 |
| Shannon entropy of properties | SEP | 25 | |
| Shannon-entropy of a protein | SE | 1 | |
| Miscellaneous | Amphiphilic pseudo amino acid composition | APAAC | 23 |
| Pseudo amino acid composition | PAAC | 21 | |
| Composition enhanced transition and distribution | CeTD | 189 | |
| Quasi-sequence order | QSO | 42 | |
| Sequence order coupling number | SOC | 2 |
Fig. 5Top 10 feature importance plot for benchmark DEEP-AMP30 training and reduced training datasets at 90%, 80%, and 70% sequence identity thresholds
Performance comparison with existing methods on the benchmark test dataset
| Method | Acc | AUROC | AUCPR | Kappa | Sen | Spe | MCC | References |
|---|---|---|---|---|---|---|---|---|
| iAMP-2L | 65.4% | – | – | 0.318 | 82.9% | 47.9% | 0.329 | Xiao et al. [ |
| iAMPpred | 70.7% | – | – | 0.415 | 80.8% | 60.6% | 0.424 | Meher et al. [ |
| AmPEP | 68.0% | 0.751 | 0.686 | 0.362 | 93.6% | 42.5% | 0.421 | Bhadra et al. [ |
| AMP Scanner DNN | 73.4% | 0.806 | 0.777 | 0.468 | 80.8% | 65.9% | 0.473 | Veltri et al. [ |
| RF-AmPEP30 | 77.1% | 0.854 | 0.868 | 0.543 | 77.6% | 76.6% | 0.542 | Yan et al. [ |
| Deep-AmPEP30 | 77.1% | 0.853 | 0.853 | 0.543 | 76.6% | 77.7% | 0.543 | Yan et al. [ |
| Co-AMPpred | 79.7% | This study | ||||||
| Co-AMPpred70 | 78.6% | 0.861 | 0.860 | 0.553 | 80.9% | 74.5% | 0.554 | This study |
| Co-AMPpred80 | 76.6% | 0.851 | 0.840 | 0.532 | 78.7% | 74.5% | 0.532 | This study |
| Co-AMPpred90 | 70.2% | 0.843 | 0.860 | 0.404 | 51.1% | 0.438 | This study |
The given data in bold font indicates the top performance of the model on the test dataset
Acc., accuracy; AUROC, area under the receiver operating characteristics curve; AUCPR, area under the precision-recall curve; Sen., sensitivity; Spe., specificity; MCC, Matthew's correlation coefficient; SD, standard deviation
Benchmark datasets used for the antimicrobial peptide (AMP) prediction
| Dataset | Training dataset | Test dataset | ||
|---|---|---|---|---|
| AMPs | Non-AMPs | AMPs | Non-AMPs | |
| Benchmark datasets | 1529 | 1529 | 94 | 94 |
| CD-HIT (90%) | 1076 | 1060 | 94 | 94 |
| CD-HIT (80%) | 946 | 1011 | 94 | 94 |
| CD-HIT (70%) | 787 | 910 | 94 | 94 |
Fig. 6The systematic architecture of the proposed method, Co-AMPpred, includes collecting the dataset, removing redundant sequence at 90%, 80%, and 70% sequence identity thresholds, feature generation and selection, machine-learning algorithms, and evaluation process