| Literature DB >> 28526820 |
Pu Wang1,2,3, Ruiquan Ge1,2,4, Liming Liu1,5, Xuan Xiao3, Ye Li6, Yunpeng Cai7.
Abstract
Antimicrobial peptides (AMPs) are peptide antibiotics with a broad spectrum of antimicrobial activities. Activity prediction of AMPs from their amino acid sequences is of great therapeutic importance but imposes challenges on prediction methods due to label interactions. In this paper we propose a novel multi-label learning model to address this problem. A weighted K-nearest neighbor classifier is adopted for efficient representation learning of the sequence data. A multiple linear regression model is then employed to learn a mapping from the classifier score vectors to the target labels, with label correlations considered. Several popular multi-label learning algorithms and feature extraction methods were tested on a comprehensive, up-to-date AMP dataset with twelve biological activities covered and its filtered version with five activities covered. The experimental results showed that our proposed method has competitive performance with previous works and could be used as a powerful engine for activity prediction of AMPs.Entities:
Year: 2017 PMID: 28526820 PMCID: PMC5438384 DOI: 10.1038/s41598-017-01986-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Number of sequences for different activities.
| No. | Activity | Count |
|---|---|---|
| 1 | Antibacterial Peptides (Antibiofilms) | 2255 |
| 2 | Antiviral Peptides (Anti-HIV) | 177 |
| 3 | Antifungal Peptides | 988 |
| 4 | Antiparasitic Peptides (Antimalaria) | 84 |
| 5 | Anticancer Peptides | 195 |
| 6 | Anti-protist Peptides | 4 |
| 7 | Insecticidal Peptides | 28 |
| 8 | Spermicidal Peptides | 12 |
| 9 | Chemotactic peptides | 56 |
| 10 | wound healing | 15 |
| 11 | Antioxidant peptides | 19 |
| 12 | Protease inhibitors | 22 |
Number and percentage of AMPs with different number of activities.
| Number of activities | Number of AMPs | Percentage (%) |
|---|---|---|
| 1 | 1449 | 57.94 |
| 2 | 829 | 33.15 |
| 3 | 172 | 6.88 |
| 4 | 34 | 1.36 |
| 5 | 12 | 0.48 |
| 6 | 2 | 0.08 |
| 7 | 1 | 0.04 |
| 8 | 1 | 0.04 |
| 9 | 1 | 0.04 |
| 10 | 0 | 0 |
| 11 | 0 | 0 |
| 12 | 0 | 0 |
| In total | 2501 | 100 |
Figure 1Sequence length distribution in APD.
Number of sequences for different activities in the filtered dataset.
| No. | Activity | Count |
|---|---|---|
| 1 | Antibacterial Peptides (Antibiofilms) | 2006 |
| 2 | Antiviral Peptides (Anti-HIV) | 155 |
| 3 | Antifungal Peptides | 903 |
| 4 | Antiparasitic Peptides (Antimalaria) | 70 |
| 5 | Anticancer Peptides | 178 |
Figure 2Averaged amino acid composition of AMPs with different activities in the filtered dataset.
Figure 3Structure diagram of the proposed multi-label learning method.
Figure 4Metric values with different combinations of superparameters by hold-out test on the original dataset. The two horizontal axes represent the values of the two superparameters, and the vertical axis represents the values of the metrics. For clarity the big metric values are mapped in red color while the small values are mapped in blue color.
Metric values of different multi-label learning methods through 5-CV on the original dataset.
| Method | Proposed | MLkNN | BPMLL | IBLR | RAkEL | CC | ECC |
|---|---|---|---|---|---|---|---|
| Metric | |||||||
| Hamming Loss ↓ | 0.0454 ± 0.0004 | 0.0528 ± 0.0006 | 0.2977 ± 0.0227 | 0.0523 ± 0.0007 | 0.0540 ± 0.0007 | 0.0600 ± 0.0015 | 0.0502 ± 0.0008 |
| Subset Accuracy ↑ | 0.5988 ± 0.0049 | 0.5450 ± 0.0040 | 0.0014 ± 0.0010 | 0.5494 ± 0.0048 | 0.5258 ± 0.0069 | 0.4992 ± 0.0082 | 0.5662 ± 0.0082 |
| Average Precision ↑ | 0.9439 ± 0.0011 | 0.9326 ± 0.0016 | 0.6691 ± 0.0680 | 0.9326 ± 0.0013 | 0.8853 ± 0.0023 | 0.8474 ± 0.0044 | 0.9210 ± 0.0020 |
| Coverage ↓ | 0.9337 ± 0.0104 | 0.9859 ± 0.0105 | 2.0595 ± 0.1971 | 0.9980 ± 0.0084 | 1.8996 ± 0.0332 | 2.0774 ± 0.0698 | 1.3728 ± 0.0207 |
| One Error ↓ | 0.0607 ± 0.0018 | 0.0768 ± 0.0026 | 0.4820 ± 0.1523 | 0.0752 ± 0.0025 | 0.1028 ± 0.0029 | 0.1711 ± 0.0070 | 0.0756 ± 0.0030 |
| Ranking Loss ↓ | 0.0234 ± 0.0005 | 0.0269 ± 0.0006 | 0.1120 ± 0.0197 | 0.0275 ± 0.0005 | 0.0809 ± 0.0026 | 0.0947 ± 0.0045 | 0.0473 ± 0.0014 |
| Fmicro ↑ | 0.8082 ± 0.0015 | 0.7679 ± 0.0022 | 0.4437 ± 0.0190 | 0.7758 ± 0.0025 | 0.7828 ± 0.0030 | 0.7574 ± 0.0048 | 0.7896 ± 0.0035 |
↓ means lower is better; ↑ means higher is better.
Comparison triplets CT(A, B) = (win/tie/loss) by paired t-test between each pair of methods on the original dataset.
| B | Proposed | MLkNN | BPMLL | IBLR | RAkEL | CC | ECC | In total |
|---|---|---|---|---|---|---|---|---|
| A | ||||||||
| Proposed | — | 7/0/0 | 7/0/0 | 7/0/0 | 7/0/0 | 7/0/0 | 7/0/0 | 42/0/0 |
| MLkNN | 0/0/7 | — | 7/0/0 | 2/4/1 | 3/1/3 | 7/0/0 | 0/1/6 | 19/6/17 |
| BPMLL | 0/0/7 | 0/0/7 | — | 0/0/7 | 0/0/7 | 0/1/6 | 0/0/7 | 0/1/41 |
| IBLR | 0/0/7 | 1/4/2 | 7/0/0 | — | 3/1/3 | 7/0/0 | 0/0/7 | 18/5/19 |
| RAkEL | 0/0/7 | 3/1/3 | 7/0/0 | 3/1/3 | — | 7/0/0 | 0/1/6 | 20/3/19 |
| CC | 0/0/7 | 0/0/7 | 6/1/0 | 0/0/7 | 0/0/7 | — | 0/0/7 | 6/1/35 |
| ECC | 0/0/7 | 6/1/0 | 7/0/0 | 7/0/0 | 6/1/0 | 7/0/0 | — | 33/2/7 |
Metric values of different multi-label learning methods through 5-CV on the filtered dataset.
| Method | Proposed | MLkNN | BPMLL | IBLR | RAkEL | CC | ECC |
|---|---|---|---|---|---|---|---|
| Metric | |||||||
| Hamming Loss ↓ | 0.0992 ± 0.0014 | 0.1083 ± 0.0009 | 0.6366 ± 0.0214 | 0.1073 ± 0.0007 | 0.1139 ± 0.0023 | 0.1258 ± 0.0025 | 0.1055 ± 0.0007 |
| Subset Accuracy ↑ | 0.6141 ± 0.0056 | 0.5874 ± 0.0033 | 0.0022 ± 0.0008 | 0.5901 ± 0.0040 | 0.5594 ± 0.0065 | 0.5280 ± 0.0108 | 0.5928 ± 0.0035 |
| Average Precision ↑ | 0.9553 ± 0.0010 | 0.9501 ± 0.0008 | 0.4018 ± 0.0416 | 0.9506 ± 0.0011 | 0.9289 ± 0.0022 | 0.8821 ± 0.0049 | 0.9505 ± 0.0009 |
| Coverage ↓ | 0.6899 ± 0.0054 | 0.7050 ± 0.0038 | 2.7546 ± 0.2624 | 0.7006 ± 0.0034 | 0.8208 ± 0.0108 | 1.0545 ± 0.0253 | 0.7032 ± 0.0051 |
| One Error ↓ | 0.0565 ± 0.0021 | 0.0669 ± 0.0012 | 0.9224 ± 0.0562 | 0.0670 ± 0.0020 | 0.0888 ± 0.0037 | 0.1517 ± 0.0085 | 0.0661 ± 0.0019 |
| Ranking Loss ↓ | 0.0444 ± 0.0010 | 0.0481 ± 0.0006 | 0.6203 ± 0.0829 | 0.0471 ± 0.0008 | 0.0714 ± 0.0025 | 0.1223 ± 0.0053 | 0.0473 ± 0.0010 |
| Fmicro ↑ | 0.8226 ± 0.0026 | 0.8011 ± 0.0020 | 0.4509 ± 0.0135 | 0.8050 ± 0.0014 | 0.8064 ± 0.0037 | 0.7834 ± 0.0038 | 0.8131 ± 0.0011 |
↓ means lower is better; ↑ means higher is better.
Comparison triplets CT(A, B) = (win/tie/loss) by paired t-test between each pair of methods on the filtered dataset.
| B | Proposed | MLkNN | BPMLL | IBLR | RAkEL | CC | ECC | In total |
|---|---|---|---|---|---|---|---|---|
| A | ||||||||
| Proposed | — | 7/0/0 | 7/0/0 | 7/0/0 | 7/0/0 | 7/0/0 | 7/0/0 | 42/0/0 |
| MLkNN | 0/0/7 | — | 7/0/0 | 0/3/4 | 6/0/1 | 7/0/0 | 0/4/3 | 20/7/15 |
| BPMLL | 0/0/7 | 0/0/7 | — | 0/0/7 | 0/0/7 | 0/0/7 | 0/0/7 | 0/0/42 |
| IBLR | 0/0/7 | 4/3/0 | 7/0/0 | — | 6/1/0 | 7/0/0 | 0/5/2 | 24/9/9 |
| RAkEL | 0/0/7 | 1/0/6 | 7/0/0 | 0/1/6 | — | 7/0/0 | 0/0/7 | 15/1/26 |
| CC | 0/0/7 | 0/0/7 | 7/0/0 | 0/0/7 | 0/0/7 | — | 0/0/7 | 7/0/35 |
| ECC | 0/0/7 | 3/4/0 | 7/0/0 | 2/5/0 | 7/0/0 | 7/0/0 | — | 26/9/7 |
The means and standard deviations of 5-CV results with the proposed method and iAMP-2L when testing on the original dataset.
| Method | Proposeda | Proposedb | iAMP-2La | iAMP-2Lb |
|---|---|---|---|---|
| Metric | ||||
| Hamming Loss ↓ |
| 0.0483 ± 0.0005 | 0.0580 ± 0.0003 | 0.0581 ± 0.0007 |
| Subset Accuracy ↑ |
| 0.5733 ± 0.0040 | 0.4880 ± 0.0041 | 0.4848 ± 0.0043 |
| Average Precision ↑ |
| 0.9383 ± 0.0012 | 0.9361 ± 0.0010 | 0.9353 ± 0.0015 |
| Coverage ↓ |
| 0.9816 ± 0.0096 | 1.1006 ± 0.0116 | 1.1121 ± 0.0161 |
| OneError ↓ |
| 0.0689 ± 0.0018 | 0.0658 ± 0.0013 | 0.0682 ± 0.0025 |
| Ranking Loss ↓ |
| 0.0259 ± 0.0005 | 0.0385 ± 0.0006 | 0.0400 ± 0.0010 |
| Fmicro ↑ |
| 0.7955 ± 0.0021 | 0.7852 ± 0.0010 | 0.7851 ± 0.0026 |
The superscript a indicates the feature extraction method in this work, and b indicates the PseAAC. ↓ means lower is better; ↑ means higher is better. The best value for each metric is in bold.
The means and standard deviations of 5-CV results with the proposed method and iAMP-2L when testing on the filtered dataset.
| Method | Proposeda | Proposedb | iAMP-2La | iAMP-2Lb |
|---|---|---|---|---|
| Metric | ||||
| Hamming Loss ↓ |
| 0.1018 ± 0.0012 | 0.1221 ± 0.0020 | 0.1212 ± 0.0023 |
| Subset Accuracy ↑ |
| 0.6033 ± 0.0041 | 0.5149 ± 0.0063 | 0.5228 ± 0.0078 |
| Average Precision ↑ |
| 0.9534 ± 0.0010 | 0.9526 ± 0.0014 | 0.9527 ± 0.0016 |
| Coverage ↓ |
| 0.6946 ± 0.0034 | 0.6911 ± 0.0055 | 0.6953 ± 0.0064 |
| OneError ↓ |
| 0.0615 ± 0.0019 | 0.0652 ± 0.0022 | 0.0638 ± 0.0028 |
| Ranking Loss ↓ |
| 0.0457 ± 0.0007 | 0.0494 ± 0.0014 | 0.0498 ± 0.0015 |
| Fmicro ↑ |
| 0.8176 ± 0.0023 | 0.8083 ± 0.0028 | 0.8091 ± 0.0033 |
The superscript a indicates the feature extraction method in this work, and b indicates the PseAAC. ↓ means lower is better; ↑ means higher is better. The best value for each metric is in bold.