| Literature DB >> 32464552 |
Jielu Yan1, Pratiti Bhadra1, Ang Li2, Pooja Sethiya2, Longguang Qin2, Hio Kuan Tai1, Koon Ho Wong3, Shirley W I Siu4.
Abstract
Antimicrobial peptides (AMPs) are a valuable source of antimicrobial agents and a potential solution to the multi-drug resistance problem. In particular, short-length AMPs have been shown to have enhanced antimicrobial activities, higher stability, and lower toxicity to human cells. We present a short-length (≤30 aa) AMP prediction method, Deep-AmPEP30, developed based on an optimal feature set of PseKRAAC reduced amino acids composition and convolutional neural network. On a balanced benchmark dataset of 188 samples, Deep-AmPEP30 yields an improved performance of 77% in accuracy, 85% in the area under the receiver operating characteristic curve (AUC-ROC), and 85% in area under the precision-recall curve (AUC-PR) over existing machine learning-based methods. To demonstrate its power, we screened the genome sequence of Candida glabrata-a gut commensal fungus expected to interact with and/or inhibit other microbes in the gut-for potential AMPs and identified a peptide of 20 aa (P3, FWELWKFLKSLWSIFPRRRP) with strong anti-bacteria activity against Bacillus subtilis and Vibrio parahaemolyticus. The potency of the peptide is remarkably comparable to that of ampicillin. Therefore, Deep-AmPEP30 is a promising prediction tool to identify short-length AMPs from genomic sequences for drug discovery. Our method is available at https://cbbio.cis.um.edu.mo/AxPEP for both individual sequence prediction and genome screening for AMPs.Entities:
Keywords: AmPEP; AxPEP; Candida glabrata; ampicillin; antimicrobial peptide; convolutional neural network; drug discovery; machine learning; reduced amino acid composition
Year: 2020 PMID: 32464552 PMCID: PMC7256447 DOI: 10.1016/j.omtn.2020.05.006
Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN: 2162-2531 Impact factor: 8.886
Five-Best Reduced Amino Acid Types and Clusters Selected from PseKRAAC for AMP Prediction
| Type | Description | No. of Clusters | Reduced Amino Acid Alphabets |
|---|---|---|---|
| 3A | based on PAM (point accepted mutation) matrix | 19 | {(FA),(P),(G),(S),(T),(D),(E),(Q),(N),(K),(R),(H),(W),(Y),(M),(L),(I),(V),(C)} |
| 7 | based on inter-residue contact energies using the Miyazawa-Jernigan matrix | 15 | {(C),(K),(R),(W),(Y),(A),(FILV),(M),(D),(E),(Q),(H),(TP),(GS),(N)} |
| 8 | based on properties of JTT (Jones-Taylor-Thornton) rate matrices | 17 | {(AT),(C),(DE),(F),(G),(H),(IV),(K),(L),(M),(N),(P),(Q),(R),(S),(V),(W)} |
| 12 | based on the substitution scores using database of aligned protein structures | 17 | {(TVLI),(M),(F),(W),(Y),(C),(A),(H),(G),(N),(Q),(P),(R),(K),(S),(T),(DE)} |
| 12 | based on the substitution scores using database of aligned protein structures | 18 | {(TVLI),(M),(F),(W),(Y),(C),(A),(H),(G),(N),(Q),(P),(R),(K),(S),(T),(D),(E)} |
Type 12 in the PseKRAAC web server corresponds to type 11 in Zuo et al.
Comparison of CNN Classifiers of Different Feature Sets By 10 Times 10-Fold Cross-Validation
| Feature Set {#} | ACC | AUC-ROC | AUC-PR | Kappa | Sn | Sp | MCC |
|---|---|---|---|---|---|---|---|
| T {21} | 71.22 ± 0.51 | 77.41 ± 0.22 | 73.97 ± 0.53 | 42.43 ± 1.01 | 78.22 ± 1.1 | 64.21 ± 0.91 | 42.86 ± 1.06 |
| C {21} | 72.65 ± 0.35 | 78.33 ± 0.12 | 75.54 ± 0.32 | 45.30 ± 0.69 | 77.85 ± 1.66 | 67.45 ± 1.29 | 45.57 ± 0.78 |
| CTD {147} | 73.71 ± 0.34 | 79.96 ± 0.21 | 76.61 ± 0.48 | 47.41 ± 0.67 | 79.05 ± 1.93 | 68.36 ± 1.51 | 47.71 ± 0.79 |
| D {105} | 73.74 ± 0.23 | 79.92 ± 0.17 | 76.73 ± 0.30 | 47.48 ± 0.46 | 79.33 ± 1.2 | 68.14 ± 1.01 | 47.79 ± 0.52 |
| AAC {20} | 74.27 ± 0.26 | 80.48 ± 0.19 | 77.52 ± 0.31 | 48.55 ± 0.51 | 80.92 ± 0.85 | 67.63 ± 1.05 | 48.99 ± 0.48 |
| SC-PseAAC {32} | 75.62 ± 0.27 | 82.07 ± 0.19 | 79.04 ± 0.37 | 51.24 ± 0.54 | 82.33 ± 0.78 | 68.91 ± 1.18 | 51.72 ± 0.45 |
| Five-best PseKRAAC {86} | 76.50 ± 0.37 | 82.48 ± 0.20 | 79.55 ± 0.5 | 53.00 ± 0.74 | 83.35 ± 0.86 | 69.65 ± 0.67 | 53.51 ± 0.78 |
Values shown are mean ± SD (values were multiplied by 100).
Parameters used for SC-PseAAC (series-correlation pseudo amino acid composition, commonly known as type-2 PseAAC) are λ = 4 and w = 0.2.
Figure 1Size Effect of the Train Dataset on Model Performance
Comparison of CNN, RF, and SVM Using Five-Best PseKRAAC Features by 10 Times 10-Fold Cross-Validation
| Algorithm | ACC | AUC-ROC | AUC-PR | Kappa | Sn | Sp | MCC |
|---|---|---|---|---|---|---|---|
| CNN (from | 76.50 ± 0.37 | 82.48 ± 0.20 | 79.55 ± 0.5 | 53.00 ± 0.74 | 83.35 ± 0.86 | 69.65 ± 0.67 | 53.51 ± 0.78 |
| RF | 75.42 ± 0.23 | 80.58 ± 0.09 | 74.85 ± 0.21 | 50.84 ± 0.46 | 81.55 ± 0.42 | 69.28 ± 0.22 | 51.23 ± 0.48 |
| SVMlinear | 72.37 ± 0.24 | 77.90 ± 0.07 | 76.08 ± 0.14 | 44.75 ± 0.49 | 68.36 ± 0.29 | 76.39 ± 0.42 | 44.95 ± 0.50 |
| SVMradial | 56.90 | 38.95 | 47.80 | 13.75 | 74.39 | 39.36 | 23.56 |
Parameters used were as follows: RF, mtry = 1, ntree = 1,200; SVMlinear, cost = 1; SVMradial, gamma= 0.008569952, cost = 0.25. Values shown are mean ± SD (values were multiplied by 100). In the text, the model of CNN is referred to as Deep-AmPEP30 and RF as RF-AmPEP30.
One 10-CV was performed.
Comparison of Our Prediction Models with Existing Methods Using the Benchmark Dataset
| Method | ACC | AUC-ROC | AUC-PR | Kappa | Sn | Sp | MCC | Reference |
|---|---|---|---|---|---|---|---|---|
| iAMP-2L | 65.43 | – | – | 31.85 | 82.98 | 47.87 | 32.95 | Xiao et al. |
| iAMPpred | 70.74 | – | – | 41.49 | 80.85 | 60.64 | 42.36 | Meher et al. |
| AmPEP | 68.09 | 75.14 | 68.63 | 36.17 | 93.62 | 42.55 | 42.07 | Bhadra et al. |
| AMP Scanner DNN | 73.40 | 80.66 | 77.78 | 46.81 | 80.85 | 65.96 | 47.34 | Veltri et al. |
| RF-AmPEP30 | 77.12 | 85.46 | 86.83 | 54.25 | 77.65 | 76.59 | 54.25 | This study |
| Deep-AmPEP30 | 77.13 | 85.31 | 85.36 | 54.26 | 76.60 | 77.66 | 54.26 | This study |
All values were multiplied by 100.
Figure 2Performance of AMP Classifiers
(A) Receiver operator characteristic curves of different AMP classifiers and (B) their run time performances on the benchmark dataset.
The Three Selected C. glabrata Genome Sequences for Experimental Validation and Their Predicted Ability to Cross Lipid Bilayer by CPPpred
| ID | Sequence | Net Charge | Length | Deep-AmPEP30 | RF-AmPEP30 | ΔG | Log Pcalc1 | Log Pcalc2 |
|---|---|---|---|---|---|---|---|---|
| P3 | FWELWKFLKSLWSIFPRRRP | +4 | 20 | 0.999090 | 0.785833 | −14.3 | −19.3 | −3.9 |
| P10 | ICTTLNWMVKLTCLTHVTLTTRWC | +2 | 24 | 0.998451 | 0.627500 | −12.8 | −5.5 | 3.1 |
| P26 | RWPPTTTLCYLSRPRRCSWTSSVCRCTLT | +7 | 29 | 0.999972 | 0.709167 | −9.2 | −31.7 | −13.4 |
Log Pcalc of membrane permeability coefficient predicted using the dragging optimization method (log Pcalc1) and the global optimization method (log Pcalc2).
Water-to-membrane transfer free energy by CPPpred..
Figure 3Anti-Bacterial Effect of Three Top-Ranked Predicted AMPs against Four Different Bacteria Species
Growth assay of Bacillus subtilis, Vibrio parahaemolyticus, Pseudomonas aeruginosa, and Escherichia coli in the absence (H2O) or presence of P3, P10, P26, and a control peptide (Pcontrol) that is known to have no anti-bacterial effect. Ampicillin was used as a positive control. Growth of bacteria was measured by absorbance at OD600 over time. The average of three independent experiments is presented. Treatment showing an inhibitory effect against the assayed bacteria is highlighted by a red box. A pink box indicates a subtle but significant (e.g., consistent in all three biological repeats) effect.
Figure 4The Architecture of Our CNN-Based Classifier for Short AMP Prediction
The model accepts a feature vector of N elements as input. First, the data values are normalized using a batch size of 64; then, the input is transferred into convoluted features by two convolutional layers and two maximum pooling layers. Each convolutional layer applies 128 kernels using a kernel size of 3 × 1 with stride 1, while each maximum pooling layer pools together data using a kernel size of 2 × 1 with stride 2. A dropout rate of 20% is applied in the maximum pooling step to prevent overfitting. Finally, all convoluted features are flattened and fed into a fully connected neural network with 10 hidden nodes and 1 output node. The rectified linear function (ReLU) is used as the activation function in the convolutional layer and by the hidden nodes, but the sigmoid function is used by the output node.