| Literature DB >> 34880873 |
Dilraj Kaur1, Sumeet Patiyal1, Chakit Arora1, Ritesh Singh2, Gaurav Lodhi2, Gajendra P S Raghava1.
Abstract
Defensins are host defense peptides present in nearly all living species, which play a crucial role in innate immunity. These peptides provide protection to the host, either by killing microbes directly or indirectly by activating the immune system. In the era of antibiotic resistance, there is a need to develop a fast and accurate method for predicting defensins. In this study, a systematic attempt has been made to develop models for predicting defensins from available information on defensins. We created a dataset of defensins and non-defensins called the main dataset that contains 1,036 defensins and 1,035 AMPs (antimicrobial peptides, or non-defensins) to understand the difference between defensins and AMPs. Our analysis indicates that certain residues like Cys, Arg, and Tyr are more abundant in defensins in comparison to AMPs. We developed machine learning technique-based models on the main dataset using a wide range of peptide features. Our SVM (support vector machine)-based model discriminates defensins and AMPs with MCC of 0.88 and AUC of 0.98 on the validation set of the main dataset. In addition, we created an alternate dataset that consists of 1,036 defensins and 1,054 non-defensins obtained from Swiss-Prot. Models were also developed on the alternate dataset to predict defensins. Our SVM-based model achieved maximum MCC of 0.96 with AUC of 0.99 on the validation set of the alternate dataset. All models were trained, tested, and validated using standard protocols. Finally, we developed a web-based service "DefPred" to predict defensins, scan defensins in proteins, and design the best defensins from their analogs. The stand-alone software and web server of DefPred are available at https://webs.iiitd.edu.in/raghava/defpred.Entities:
Keywords: AMPs; computer aided; defensins; innate immunity; machine learning
Mesh:
Substances:
Year: 2021 PMID: 34880873 PMCID: PMC8645896 DOI: 10.3389/fimmu.2021.780610
Source DB: PubMed Journal: Front Immunol ISSN: 1664-3224 Impact factor: 7.561
Figure 1A schematic diagram for the role of defensins in the host immune system.
Figure 2A brief workflow of the study.
Figure 3The average amino acid compositional analysis among defensins, AMPs, and non-defensins.
Figure 4Two-sample logos generated from the (A) C-terminus (last 10 residues) of the main dataset, (B) N-terminus (first 10 residues) of the main dataset, (C) C-terminus (last 10 residues) of the alternate dataset, and (D) N-terminus (first 10 residues) of the alternate dataset.
The performance of the machine learning models on SVC-L1 selected features for both datasets.
| Model | Hyperparameters | Training set | Validation set | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sens | Spec | ACC | AUROC | MCC | Sens | Spec | ACC | AUROC | MCC | ||
| Main dataset | |||||||||||
| SVM | C = 2, g = 1, k = rbf | 93.24 | 94.81 | 94.03 | 0.98 | 0.88 | 93.72 | 93.24 | 93.48 | 0.97 | 0.87 |
| LR | C = 1 | 92.4 | 91.67 | 92.03 | 0.97 | 0.84 | 92.75 | 91.79 | 92.27 | 0.97 | 0.85 |
| ET | ne = 30 | 93.73 | 94.08 | 93.9 | 0.98 | 0.88 | 93.24 | 93.72 | 93.48 | 0.97 | 0.87 |
| RF | ne = 90 | 91.07 | 95.41 | 93.24 | 0.98 | 0.87 | 91.3 | 95.17 | 93.24 | 0.98 | 0.87 |
| KNN | al = ball-tree, nn = 10, w = distance | 92.52 | 94.32 | 93.42 | 0.97 | 0.87 | 92.27 | 90.82 | 91.55 | 0.96 | 0.83 |
| MLP | a = identity, HL = 3, m = 100, s = adam | 92.4 | 89.73 | 91.07 | 0.95 | 0.82 | 93.72 | 87.92 | 90.82 | 0.96 | 0.82 |
| Alternate dataset | |||||||||||
| SVM | C = 2, g = 0.5, k = rbf | 95.05 | 98.46 | 96.77 | 0.99 | 0.94 | 97.1 | 99.05 | 98.09 | 0.99 | 0.96 |
| LR | C = 10 | 94.93 | 97.86 | 96.41 | 0.99 | 0.93 | 94.69 | 98.58 | 96.65 | 0.99 | 0.93 |
| ET | ne = 50 | 94.09 | 98.93 | 96.53 | 0.99 | 0.93 | 94.69 | 99.53 | 97.13 | 0.99 | 0.94 |
| KNN | al = brute, nn = 10, w = distance | 92.88 | 98.22 | 95.57 | 0.99 | 0.91 | 94.69 | 98.58 | 96.65 | 0.98 | 0.93 |
| RF | ne = 70 | 95.66 | 97.27 | 96.47 | 0.99 | 0.93 | 96.14 | 97.16 | 96.65 | 0.99 | 0.93 |
| MLP | a = tanh, HL = 10, m = 100, s = adam | 92.4 | 97.86 | 95.16 | 0.98 | 0.9 | 93.72 | 98.1 | 95.93 | 0.98 | 0.92 |
g, gamma; ne, n_estimators; k, kernel; a, activation; HL, hidden layer size; s, solver; al, algorithm; w, weight; m, max_iter; nn, n_neighbors.
The performance of machine learning models on top 60 features for main dataset and top 50 features for alternate dataset.
| Model | Hyperparameters | Training dataset | Validation dataset | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sens | Spec | ACC | AUROC | MCC | Sens | Spec | ACC | AUROC | MCC | ||
| Main top 60 | |||||||||||
| SVM | C = 2, g = 1, k = rbf | 89.26 | 96.74 | 93 | 0.98 | 0.86 | 90.82 | 97.1 | 93.96 | 0.98 | 0.88 |
| LR | C = 0.1 | 86.85 | 93.24 | 90.04 | 0.96 | 0.8 | 88.89 | 93.72 | 91.3 | 0.97 | 0.83 |
| ET | ne = 50 | 92.4 | 95.41 | 93.9 | 0.98 | 0.88 | 92.4 | 95.41 | 93.9 | 0.98 | 0.88 |
| RF | ne = 60 | 91.68 | 95.29 | 93.48 | 0.98 | 0.87 | 91.3 | 94.69 | 93 | 0.98 | 0.86 |
| MLP | a = tanh, HL = 17,m = 100, s = adam | 74.79 | 70.77 | 72.78 | 0.85 | 0.46 | 91.79 | 91.3 | 91.55 | 0.96 | 0.83 |
| KNN | al = ball-tree, nn = 10, w = distance | 91.8 | 93 | 92.4 | 0.97 | 0.85 | 91.79 | 90.34 | 91.06 | 0.96 | 0.82 |
| Alternate top 50 | |||||||||||
| SVM | C = 2, g = 1, k = rbf | 95.17 | 97.98 | 96.59 | 0.99 | 0.93 | 97.1 | 99.05 | 98.09 | 0.99 | 0.96 |
| LR | C = 1 | 95.54 | 95.02 | 95.28 | 0.99 | 0.91 | 95.65 | 95.73 | 95.69 | 0.98 | 0.91 |
| ET | ne = 40 | 95.17 | 98.22 | 96.71 | 0.99 | 0.93 | 95.65 | 98.58 | 97.13 | 0.99 | 0.94 |
| KNN | al = ball-tree, nn = 9, w = distance | 94.33 | 97.86 | 96.11 | 0.99 | 0.92 | 95.65 | 98.1 | 96.89 | 0.98 | 0.94 |
| RF | ne = 50 | 95.3 | 98.22 | 96.77 | 0.99 | 0.94 | 96.65 | 97.63 | 96.65 | 0.99 | 0.93 |
| MLP | a = tanh, HL = 15, m = 100, s = adam | 92.64 | 97.75 | 95.22 | 0.98 | 0.91 | 92.27 | 97.63 | 94.98 | 0.98 | 0.9 |
g, gamma; ne, n_estimators; k, kernel; a, activation; HL, hidden layer size; s, solver; al, algorithm; w, weight; m, max_iter; nn, n_neighbors.
Figure 5AUROC plots (A) main (top 60 selected features on the training datasets), (B) main (top 60 selected features on the validation datasets), (C) alternate (top 50 selected features on the training datasets), and (D) alternate (top 50 selected features on the validation datasets).
Describing the major components of the existing methods and DefPred such as the source of the dataset, size of data, major features, type, and performance.
| Study | Source of dataset | Size of data | Major features | Classifier used | Type | Accuracy | Web server availability, status | PMID |
|---|---|---|---|---|---|---|---|---|
|
| Defensin Knowledgebase | 286 P | ID_RAAA | Jackknife test | Prediction | 91.36% | No | 19591890 |
|
| PubMed, iHOP, UniProt, HubMed | 238 P, 238 N | RQA descriptors | RF | Classification | 78.12% | No | Not-available |
|
| NCBI, UniProt | 383 P, 383 N | AAC, DPC, PSAAC | SVM | Classification | 99% | Yes, inactive | 22670676 |
|
| Defensin Knowledgebase | 333 P | iDEF-PseRAAAC | SVM | Prediction | 85.59% | Yes, inactive | 26713618 |
|
| Defensin Knowledgebase | 328 P | iDEF-PseRAAC | SVM | Prediction | 91.16% | Yes, active | 31391777 |
| DefPred | CAMPR3, DRAMP2.0, Defensin Knowledgebase, Swiss-Prot | 1,036 P, 1,035 N (main); 1,036 P, 1,054 N (alternate) | Selected features | SVM | Prediction | 93.96% (main), 98.09% (alternate) | Yes, active | Not-available |
P, positive sequence; N, negative sequence.