| Literature DB >> 32180904 |
Jesus A Beltran1, Gabriel Del Rio2, Carlos A Brizuela1.
Abstract
Antimicrobial peptides (AMPs) are a promising alternative to small-molecules-based antibiotics. These peptides are part of most living organisms' innate defense system. In order to computationally identify new AMPs within the peptides these organisms produce, an automatic AMP/non-AMP classifier is required. In order to have an efficient classifier, a set of robust features that can capture what differentiates an AMP from another that is not, has to be selected. However, the number of candidate descriptors is large (in the order of thousands) to allow for an exhaustive search of all possible combinations. Therefore, efficient and effective feature selection techniques are required. In this work, we propose an efficient wrapper technique to solve the feature selection problem for AMPs identification. The method is based on a Genetic Algorithm that uses a variable-length chromosome for representing the selected features and uses an objective function that considers the Mathew Correlation Coefficient and the number of selected features. Computational experiments show that the proposed method can produce competitive results regarding sensitivity, specificity, and MCC. Furthermore, the best classification results are achieved by using only 39 out of 272 molecular descriptors.Entities:
Keywords: Antimicrobial peptide; Feature selection; Genetic algorithm; Wrapper method
Year: 2020 PMID: 32180904 PMCID: PMC7063200 DOI: 10.1016/j.csbj.2020.02.002
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Schematic process of the automatic selection of peptide representation based on molecular descriptors and the antimicrobial activity classification.
Summary of the 272 molecular descriptors considered as a the universe set of feature for the peptide representation, these are grouped by dimensionality into 0D and 1D.
| Group | Name | No. of molecular descriptors | No. of descriptors’ values | Reference |
|---|---|---|---|---|
| 0D | Standard amino acid composition | 1 | 20 | |
| Reduce amino acid composition | 10 | 41 | ||
| Aliphatic index | 1 | 1 | ||
| Net charge and mean net charge | 6 | 6 | ||
| Grand Average of Hydrophilicity | 2 | 2 | ||
| Grand Average of Hydropathy (GRAVY) | 1 | 1 | ||
| Grand Average of Hydrophobicity | 23 | 23 | ||
| Charge at different pH values (5, 7, and 9) | 3 | 3 | ||
| Boman index | 1 | 1 | ||
| Molecular weight | 1 | 1 | ||
| Number of amino acids | 1 | 1 | ||
| 1D | Instability index | 1 | 1 | |
| Reduced amino acid Transition | 10 | 21 | ||
| Reduced amino acid distribution | 50 | 105 | ||
| Dipeptide | 1 | 9 | ||
| Tripeptide | 1 | 27 | ||
| Max mean hydrophobicity | 1 | 1 | ||
| Hydrophobic moment | 3 | 3 | ||
| Isoelectric Point | 1 | 1 | ||
| In vitro aggregation | 1 | 1 | ||
| turn structure propensity | 1 | 1 | ||
| 1 | 1 | |||
| 1 | 1 | |||
Fig. 2Venn diagram of considered benchmark datasets for SAGAFS’s test. The level of overlap among datasets DAT1 [4], DAT2 [6], and DAT3 [10] corresponds only to AMPs, i.e., there is no intersection between non-antimicrobial peptides of any pair of datasets.
Mean performance values with their respective standard deviation of the best solutions obtained with the SAGAFS algorithm for the three benchmark datasets after 30 runs. The results are presented as the mean one standard deviation.
| Dataset | MLA | Acc (%) | Sn | Sp | F-score | MCC | ROC area |
|---|---|---|---|---|---|---|---|
| DAT1 | SVM-L | ||||||
| RF | |||||||
| DAT2 | SVM-L | ||||||
| RF | |||||||
| DAT3 | SVM-L | ||||||
| RF | |||||||
MLA, Machine Learning Algorithm: RF = Random Forest; SVM-L = Support Vector Machine-Linear.
Fig. 3Performance comparison among the best solutions obtained by SAGAFS + SVM-L and SAGAFS + RF after 30 runs. The triangles indicate the MCC for the base-line model (upper left and right figures). The lower part (left and right) depicts the percentage of reduction in number of descriptors with respect to the base-line (272 descriptors).
Fig. 4Most frequently selected features for SAGAFS on each dataset. The plots in the lower part represent the indices for the most frequent features for the model generated by Random Forest (RF), while the plots in the upper part show the indices for the SVM-L.
Performance comparison of SAGAFS method with ANFIS [4] on the dataset DAT1.
| Method | MLA | Dataset | ACC(%) | Sn | Sp | F1-score | MCC |
|---|---|---|---|---|---|---|---|
| ANFIS | Training | 96.23 | 1.00 | 0.93 | 0.96 | 0.93 | |
| Testing | 100 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| Validation | 94.34 | 0.96 | 0.92 | 0.95 | 0.89 | ||
| Overall | 96.73 | 0.95 | |||||
| SAGAFS | RF | Training | 100 | 1.00 | 1.00 | 1.00 | 1.00 |
| Testing | 84.48 | 0.88 | 0.79 | 0.84 | 0.70 | ||
| Validation | 100 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| Overall | 0.97 | ||||||
Machine Learning Algorithm (MLA): RF = Random Forest; ANFIS = Adaptive Neuro-fuzzy Inference System.
Bold font indicates the best value per measure.
Performance comparison of SAGAFS method and CAMP [7] on the dataset DAT2.
| Method | MLA | MCC | Performance in (%) | 10-fold CV | |||
|---|---|---|---|---|---|---|---|
| Train | Test | Sn | Sp | ACC | ACC(%) | ||
| RF | 0.82 | 93.7 | |||||
| SVM | 0.83 | 89.7 | 93.1 | 91.6 | 92.6 | ||
| ANN | 0.72 | 0.72 | 82.9 | 88.9 | 86.3 | 86.9 | |
| SAGAFS | RF | 0.87 | 88.5 | 92.4 | 93.3 | ||
Machine Learning Algorithm (MLA): RF = Random Forest; SVM = Support Vector Machine with polynomial kernel (degree 4); ANN = Artificial Neural Network. Bold font indicates the best value per measure.
Performance comparison of SAGAFS method with iAMP-2L [10] and MLAMP [53] on the dataset DAT3.
| Method | MLA | Sn(%) | Sp(%) | ACC(%) | MCC |
|---|---|---|---|---|---|
| iAMP-2L | FKNN | 87.13 | 86.03 | 86.32 | 0.727 |
| MLAMP | RF | 77.00 | 94.60 | 89.90 | 0.737 |
| SAGAFS | RF |
Machine Learning Algorithm (MLA): RF = Random Forest; FKNN = Fuzzy K-Nearest Neighbor. Bold font indicates the best value per measure.