| Literature DB >> 29374199 |
Pratiti Bhadra1, Jielu Yan1, Jinyan Li1, Simon Fong1, Shirley W I Siu2.
Abstract
Antimicrobial peptides (AMPs) are promising candidates in the fight against multidrug-resistant pathogens owing to AMPs' broad range of activities and low toxicity. Nonetheless, identification of AMPs through wet-lab experiments is still expensive and time consuming. Here, we propose an accurate computational method for AMP prediction by the random forest algorithm. The prediction model is based on the distribution patterns of amino acid properties along the sequence. Using our collection of large and diverse sets of AMP and non-AMP data (3268 and 166791 sequences, respectively), we evaluated 19 random forest classifiers with different positive:negative data ratios by 10-fold cross-validation. Our optimal model, AmPEP with the 1:3 data ratio, showed high accuracy (96%), Matthew's correlation coefficient (MCC) of 0.9, area under the receiver operating characteristic curve (AUC-ROC) of 0.99, and the Kappa statistic of 0.9. Descriptor analysis of AMP/non-AMP distributions by means of Pearson correlation coefficients revealed that reduced feature sets (from a full-featured set of 105 to a minimal-feature set of 23) can result in comparable performance in all respects except for some reductions in precision. Furthermore, AmPEP outperformed existing methods in terms of accuracy, MCC, and AUC-ROC when tested on benchmark datasets.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29374199 PMCID: PMC5785966 DOI: 10.1038/s41598-018-19752-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
A comparison of four RF classifiers using different feature sets by 10-fold cross-validation with the AMP data ratio of 1:1. Values shown are the mean and standard deviation (in parentheses).
| Feature set {#} |
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| CTD {147} | 0.979 (0.002) | 0.944 (0.004) | 0.961 (0.002) | 0.924 (0.005) | 0.988 (0.001) | 0.698 (0.023) | 0.923 (0.005) |
| C {21} | 0.958 (0.002) | 0.943 (0.004) | 0.950 (0.002) | 0.901 (0.004) | 0.983 (0.001) | 0.747 (0.018) | 0.901 (0.004) |
| T {21} | 0.959 (0.002) | 0.943 (0.004) | 0.951 (0.002) | 0.901 (0.004) | 0.983 (0.001) | 0.745 (0.018) | 0.901 (0.005) |
| D {105} | 0.978 (0.002) | 0.945 (0.004) | 0.962 (0.002) | 0.924 (0.004) | 0.988 (0.001) | 0.698 (0.024) | 0.923 (0.005) |
Figure 1Pearson correlation coefficients (PCCs) between AMP and non-AMP distributions of the same descriptor in the Mmodel_train dataset.
Figure 2Performance of RF classifiers during 10-fold cross-validation on datasets with different AMP/non-AMP ratios.
Performance comparison of classifiers trained with P:N ratio of 1:1 and 1:3 against classifiers with applied SMOTE data rebalancing technique (k = 5) on initial data ratio of 1:3 and 1:10. Values shown were obtained from 10-fold cross validation.
| Method |
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| AmPEP (1:1) |
| 0.945 | 0.962 |
| 0.988 | 0.698 |
| 0.588 |
| AmPEP (1:3) | 0.950 | 0.965 | 0.962 | 0.900 | 0.989 | 0.830 | 0.899 |
|
| AmPEP with SMOTE (1:3) | 0.957 | 0.966 | 0.964 | 0.905 |
| 0.817 | 0.905 | 0.663 |
| AmPEP with SMOTE (1:10) | 0.858 |
|
| 0.835 |
|
| 0.835 | 0.594 |
Performance of four RF classifiers involving different feature subsets during 10-fold cross-validation at the AMP/non-AMP data ratio of 1:3. Values shown are the mean and standard deviation (in parentheses). Each experiment uses all AMP data and a set of non-AMP data randomly drawn without replacement from the non-AMPs of the dataset. The best-performing models on a particular measure are highlighted.
| Feature set {#} |
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| DF {105} | ||||||||
| DF_PCC<0.7 {80} | 0.961 (0.002) | 0.819 (0.011) | ||||||
| DF_PCC<0.6 {43} | 0.961 (0.002) | 0.899 (0.005) | 0.793 (0.008) | 0.633 (0.008) | ||||
| DF_PCC<05 {23} | 0.949 (0.004) | 0.961 (0.002) | 0.898 (0.005) | 0.988 (0.001) | 0.779 (0.011) | 0.898 (0.005) | 0.620 (0.009) |
A comparison of RF classifiers using different descriptors by 10-fold cross-validation with the AMP data ratio of 1:3. Values shown are averages and standard deviations (in brackets) over 10 times of 10-fold cross validation.
| Feature set {#} |
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| AmPEP {105} | 0.965 (0.002) | 0.830 (0.009) | 0.665 (0.006) | |||||
| AmPEP {23} | 0.965 (0.002) | 0.779 (0.011) | 0.620 (0.009) | |||||
| AAC {20} | 0.910 (0.002) | 0.956 (0.001) | 0.881 (0.002) | 0.862 (0.002) | 0.881 (0.002) | 0.662 (0.004) | ||
| PAAC {24} | 0.910 (0.002) | 0.970 (0.000) | 0.955 (0.000) | 0.881 (0.001) | 0.985 (0.000) | 0.891 (0.001) | 0.881 (0.001) | 0.681 (0.002) |
| K-mer {400} | 0.898 (0.002) | 0.953 (0.001) | 0.875 (0.002) | 0.985 (0.000) | 0.875 (0.002) |
| ||
| Auto Covariance (AC) {6} | 0.613 (0.003) | 0.942 (0.001) | 0.860 (0.001) | 0.604 (0.003) | 0.874 (0.001) | 0.742 (0.003) | 0.597 (0.003) | 0.234 (0.003) |
| Cross Covariance (CC) {12} | 0.661 (0.004) | 0.949 (0.001) | 0.877 (0.001) | 0.657 (0.004) | 0.905 (0.001) | 0.769 (0.002) | 0.651 (0.004) | 0.298 (0.004) |
| Auto-Cross Covariance (ACC) {18} | 0.710 (0.003) | 0.951 (0.001) | 0.891 (0.001) | 0.698 (0.003) | 0.922 (0.001) | 0.825 (0.002) | 0.695 (0.003) | 0.369 (0.004) |
| Parallel Correlation Pseudo Amino Acid Composition (PC-PseAAC) {22} | 0.908 (0.003) | 0.955 (0.001) | 0.881 (0.002) | 0.985 (0.000) | 0.884 (0.005) | 0.881 (0.002) | 0.676 (0.006) | |
| Series Correlation Pseudo Amino Acid Composition (SC-PseAAC) {26} | 0.907 (0.002) | 0.955 (0.001) | 0.880 (0.002) | 0.985 (0.000) | 0.882 (0.004) | 0.880 (0.002) | 0.673(0.005) | |
| General Parallel Correlation Pseudo Amino Acid Composition (PC-PseAAC-General){22} | 0.909 (0.002) | 0.970 (0.000) | 0.955 (0.001) | 0.880 (0.002) | 0.985 (0.000) | 0.880 (0.002) |
| |
| Parallel Series Correlation Pseudo Amino Acid Composition (SC-PseAAC-General) {26} | 0.908 (0.002) | 0.970 (0.001) | 0.955 (0.001) | 0.879 (0.002) | 0.985 (0.000) | 0.894 (0.003) | 0.879 (0.002) | 0.680 (0.003) |
The best two results in each performance measure are highlighted. AAC: Amino Acid Composition, PAAC: Pseudo Amino Acid Composition AAC and PseAAC were generated using propy 1.0 package (default parameter of propy is used). Other descriptors, K-mer, AC, CC, ACC, PC-PseAAC, SC-PseAAC, PC-PseAAC-General, SC-PseAAC-General were generated by Pse-in-One-1.0.4 using default parameters.
A comparison of our AMP prediction method with state-of-the-art methods on AUC-ROC, AUC-PR, MCC, and κ by means of datasets Ctrain and Ctest.
| Method | ML algorithm | Number of features | AUC-ROC | AUC-PR | MCC |
|
|---|---|---|---|---|---|---|
| iAMPpred# | SVM | 66 | 0.98 |
| 0.91 | — |
| iAMP-2L# | FKNN | 46 | 0.95 | — | 0.9 | — |
| AmPEP (DF) | RF | 105 |
| 0.957 |
| 0.962 |
| AmPEP (DF_PCC < 0.7) | RF | 80 | 0.994 | 0.950 | 0.914 | 0.913 |
| AmPEP (DF_PCC < 0.6) | RF | 43 | 0.994 | 0.934 | 0.919 | 0.918 |
| AmPEP (DF_PCC < 0.5) | RF | 23 |
| 0.905 |
| 0.923 |
#Results were taken from refs[5,6].
A summary of the positive and negative datasets. Values in braces are the numbers of sequences collected in that category.
| Dataset | Model Design | Comparative Study | |
|---|---|---|---|
| Training (Mmodel_train) | Benchmark Training (Ctrain) | Benchmark Testing (Ctest) | |
| Positive | APD3, CAMPR3, LAMP {3268} | Xiao {770} | Xiao {920} |
| Negative | UniProt {166791} | Xiao {2405} | Xiao {920} |
Physicochemical properties and groupings of amino acids.
| Property | Grouping | ||
|---|---|---|---|
| Class 1 (C1) | Class 2 (C2) | Class 3 (C3) | |
| Hydrophobicity | Polar R, K, E, D, Q, N | Neutral G, A, S, T, P, H, Y | Hydrophobic |
| Normalized van der Waals volume | Volume range 0-2.78 | Volume range 2.95-4.0N, V, E, Q, I, L | Volume range 4.03-8.08 M, H, K, F, R, Y, W |
| Polarity | Polarity value 4.9-6.2 L, I, F, W, C, M, V, Y | Polarity value 8.0-9.2 P, A, T, G, S | Polarity value 10.4-13 H, Q, R, K, N, E, D |
| Polarizability | Polarizability value 0-0.108 | Polarizability value 0.128-0.186 C, P, N, V, E, Q, I, L | Polarizability value 0.219-0.409 K, M, H, F, R, Y, W |
| Charge | Positive K, R | Neutral A, N, C, Q, G, H, I, L, M, F, P, S, T, W, Y, V | Negative D, E |
| Secondary structure | Helix E, A, L, M, Q, K, R, H | Strand V, I, Y, C, W, F, T | Coil G, N, P, S, D |
| Solvent accessibility | Buried A, L, F, C, G, I, V, W | Exposed P, K, Q, E, N, D | Intermediate M, P, S, T, H, Y |
Figure 3Illustration of the calculations of DF with a sample antibacterial peptide.