Literature DB >> 32180904

An automatic representation of peptides for effective antimicrobial activity classification.

Jesus A Beltran¹, Gabriel Del Rio², Carlos A Brizuela¹.

Abstract

Antimicrobial peptides (AMPs) are a promising alternative to small-molecules-based antibiotics. These peptides are part of most living organisms' innate defense system. In order to computationally identify new AMPs within the peptides these organisms produce, an automatic AMP/non-AMP classifier is required. In order to have an efficient classifier, a set of robust features that can capture what differentiates an AMP from another that is not, has to be selected. However, the number of candidate descriptors is large (in the order of thousands) to allow for an exhaustive search of all possible combinations. Therefore, efficient and effective feature selection techniques are required. In this work, we propose an efficient wrapper technique to solve the feature selection problem for AMPs identification. The method is based on a Genetic Algorithm that uses a variable-length chromosome for representing the selected features and uses an objective function that considers the Mathew Correlation Coefficient and the number of selected features. Computational experiments show that the proposed method can produce competitive results regarding sensitivity, specificity, and MCC. Furthermore, the best classification results are achieved by using only 39 out of 272 molecular descriptors.

Entities: Chemical Disease Gene Species

Keywords: Antimicrobial peptide; Feature selection; Genetic algorithm; Wrapper method

Year: 2020 PMID： 32180904 PMCID： PMC7063200 DOI： 10.1016/j.csbj.2020.02.002

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Antimicrobial peptides (AMPs) are a promising alternative for combating pathogens resistant to conventional antibiotics, mainly because of their multiple direct action mechanisms against microbes (e.g., bacteria, fungi, and virus) and in consequence their low susceptibility of antimicrobial resistance; they also have been an effective weapon to fight against multi-drug-resistant microbial pathogens in vitro tests [1] and few others are currently being used to treat microbial infections in humans [2]. Next-Generation Sequencing (NGS) technologies are generating a vast amount of data (e.g., DNA, RNA, or protein) where peptides with antimicrobial activity might be found. Identifying these peptides will only be possible through the development of computer-assisted strategies. These strategies can automatically evaluate a large amount of data and identify candidates to antimicrobial peptides before their biological evaluation in the wet lab. In this context, an important aspect is the development of machine learning models that determine whether or not an amino acid sequence is antimicrobial. Quantitative Structure-Activity Relationship (QSAR) modeling has been widely applied to AMP discovery for the development of classification models [3]. QSAR mathematically relates the quantitative physicochemical properties extracted from the peptides, termed molecular descriptors, with their corresponding biological activity through a predictive mathematical model. There are two crucial aspects to QSAR modeling: the choice of the descriptors set that defines the feature of the peptides of interest and the selection of the statistical learning technique to create a model [4], [5]. Computational research has mainly focused on the second aspect, where several machine learning algorithms (MLAs) have been proposed for this purpose. Examples of these MLAs includes Discriminant Analysis (DA) [6], Random Forest (RF) [6], [7], Support Vector Machine (SVM) [6], [8], [7], Artificial Neural Network (ANN) [5], [8], [7], Adaptive Neuro-Fuzzy Inference System (ANFIS) [4], Binary Logic Regression (BLR) [9] and Fuzzy K-Nearest Neighbor (FKNN) [10]. In overall, the proposed algorithms allow generating models with a predictive accuracy of up to 96%. However, these studies used different databases to measure the performances of their approaches for AMP’s recognition. On one hand, an amino acid sequence is considered to be AMP if it is labeled as such in a given database that collects only experimentally validated sequences. On the other hand, due to the difficulty to guarantee that a given sequence is not AMP, the databases define as such, sequences that passed through a strict filtering process aimed at increasing the prob-ability that they will not have antimicrobial properties. In our case, three databases DAT1, DAT2 and DAT3, used in the literature are considered. These databases are explained in subSection 3.1. It is important to mention that a more precise definition of what an AMP is, in terms of its MIC, is required to advance to the next level of granularity in the prediction of AMP activity. There are many methods for the selection of descriptors for peptide representation, they are mainly based on two approaches: expert’s knowledge [11], [8] or filtering methods [12], [8], [7], [13], [4]. However, these methods do not consider complex interactions in a set of descriptors. A way to overcome this limitation is by the use of wrapper methods, which has received little attention from the AMP’s research community, although it is an essential aspect for determining the performance of predictive models since those descriptors define the chemical space where each peptide is projected and in consequence the efficiency of the classification depends on it. Furthermore, currently, a large number of descriptors can be calculated for peptides. In earlier studies, the selection of molecular descriptors has often been made based on chemical intuition or observed properties that give rise to the antimicrobial activity [11], [8]. On the other hand, recent studies employ hand-picked features (descriptors) procedures or filtering methods that independently evaluate the features according to a given criterion and select the top k features [8], [7], [4]. However, most of these approaches focused on the pairwise relationship and interaction of the descriptors, while the biological activity might depend on the relation of three or more descriptors. Therefore, a feature selection procedure is needed in order to improve the performance of learning models [14]. In this paper, we propose a novel method to automatically select a peptide representation, based on molecular descriptors, that efficiently performs the classification of the peptide’s antimicrobial activity. For this purpose, our method combines what we call a species adaptive genetic algorithm (SAGA) and a machine learning model to efficiently search promising solutions and to estimate the fitness directly for each subset of molecular descriptors. We systematically evaluate our proposed method and compared it with the state-of-the-art AMP classification methods on three well-known benchmarks.

Materials and methods

The aim of our approach is choosing a molecular descriptors’ representation of peptides to discern between AMP and non-AMP sequences. The choice of descriptors can be formulated as a feature subset selection problem (FSSP). In supervised learning, the FSSP can be defined as: given a dataset described by a set of features, select those features that are useful for building a good classifier [15]. In general, the usefulness is given by the predictive power of the classifier instead of the relevance of individual features. Next, we introduce some notation and formally define the FSSP.

FSSP formulation

Consider a labeled dataset of n peptides described by a set of m input molecular descriptors and a label set . Here is an m-dimensional vector with a class label from Y. The component is the measurement of the j-th molecular descriptor for the i-th peptide. Statement. Let be a machine learning algorithm, a dataset and J a performance criterion measured over all classifiers , induced by and , then, the formal definition of the FSSP is [16]:where is a dimensional reduction of the dataset obtained by removing the values of variables that are not in from each . It is important to note that the optimal subset feature is not necessarily unique, i.e., it is possible to achieve the same value for the performance criterion using different subsets of features [16]. Notice also that the size of is unknown a priori, this makes FSSP harder than a related problem where the size of the desired feature set is given [17].

Characterization of wrapper methods

The formulation of the FSSP in this manner allows for the use of well-known optimization techniques that use, in their inner loops, machine learning algorithms to evaluate the quality of subsets of features. Methods that use the classification performance of the machine learning algorithm to guide the search towards the optimal subset of features are categorized as wrapper methods. According to [15], there are three considerations to characterize a wrapper method: (i) a search strategy; (ii) a performance estimation method and (iii) a machine learning algorithm. Search strategies. These define how to search through the space of feature subsets (there are candidate feature subsets). In general, search strategies partially sample the search space, since for large values of n (i.e., more than 40 features), the space becomes infeasible to be exhaustively explored [18]. The problem of finding the optimal feature subset is NP-hard [19]. Search strategies can be divided into three broad categories: exponential, greedy, and randomized. In short, the exponential search guarantees to find the optimal subset from a feature set. This strategy includes such searches as exhaustive enumeration, branch and bound [20] and beam search [21]. On the contrary, the greedy and randomized searches cannot guarantee to find the optimal subset. However, they are valid alternatives when the number of features is high. On the one hand, the greedy search makes a locally optimal choice, i.e., it always selects a feature sequentially, for adding or removing, in order to maximize the current objective function (it adds or removes a single feature at a time to maximize the objective function). Some examples of greedy search algorithms are a sequential forward selection, sequential backward selection, bidirectional search, and greedy hill-climbing search [18]. On the other hand, the randomized search uses a sampling of the space of possible subsets for searching the optimal feature subset. The advantages of this approach are: it is possible to find a solution quickly, and it is capable of avoiding getting trapped at local optima [22]. Some example of randomized search are: MC1 [23], random mutation hill climbing [24], ant colony [25], simulated annealing [26] and genetic algorithms [27]. Performance estimation methods. To measure the quality of a feature subset, we need a performance estimation method that measures the predictive ability of the classifier induced by a particular machine learning algorithm and a dataset represented by the reduced feature set. Accordingly, the optimal subset is the one with the best performing classifier. The performance estimation method employs a metric (e.g., accuracy, MCC, precision, recall) and a re-sampling method to partition the dataset into a training and test sets. In the re-sampling method, the split process can be repeated multiple times. The most common methods are cross-validation and bootstrap [15]. A machine learning algorithm. Wrapper methods need a machine learning algorithm to build a classifier. Examples of machine learning algorithms includes support vector machine (SVM), random forest (RF), k-nearest neighbor (k-NN), multilayer perceptron (MLP) and c4.5 algorithm, among others.

The feature selection approach

In this subsection, we present a wrapper method to solve the FSSP problem. The three components exposed earlier for the AMP’s classification problem is described next. Search strategy. We propose a Species Adaptive Genetic Algorithm for Feature Selection (SAGAFS). SAGAFS is an adapted version of two well-known algorithms: a Genetic Algorithm (GA) and Variable Length Representation Evolutionary Algorithm (VLREA) [28], [29]. GA is commonly recommended for large-scale feature selection problems (i.e., from now on 50 or more candidate features) [27]. On the other hand, VLREA is appropriate for problems where the solution length contributes to its fitness, as it happens in our case. To the best of our knowledge, this is the first time a VLR evolutionary algorithm is applied to solve the feature selection problem. The proposed SAGAFS algorithm includes a variable length representation and neighboring spaces strategies to efficiently sample the vast search space. Performance estimation method. We used k-fold cross-validation to estimate the average Matthews correlation coefficient (MCC) of the induced classifier. In the k-fold cross-validation, the dataset is partitioned into k non-empty disjoint subsets . Each subset (i.e., fold) has roughly equal size. Then, we repeat k times the following procedures: the machine learning algorithm induces a classifier using the dataset , and the classifier is tested on the subset . The MCC estimation is calculated by averaging it over the k runs. This coefficient is given bywhere , and FN are the number of True Positives, True Negatives, False Positive, and False Negatives, respectively. A machine learning algorithm. For the generation of a binary classifier, we used two machine learning algorithms: the first one, a linear classifier, Support Vector Machine-linear (SVM-L); the second one, a non-linear classifier, Random Forest (RF). The methodology adopted in this study is described in the following sections and a scheme of this is shown in Fig. 1.

Fig. 1

Schematic process of the automatic selection of peptide representation based on molecular descriptors and the antimicrobial activity classification.

From peptide sequences to molecular descriptors

Several studies have been found five major properties related to the antimicrobial activity of peptides; these include conformation, charge, hydrophobic character and secondary structure [42], [43], [11], [44]. In this direction, molecular descriptors have been widely applied for extracting these properties from peptides in a quantitative way. In this study, a total of 122 molecular descriptors were collected, from which 272 values were derived. The number of molecular descriptors for each property was: 74 at amino acid composition, 10 at charge, 31 at hydrophobic character, 5 at secondary structure and 2 at other properties. The molecular descriptors included in this work have been used in previous antimicrobial peptide studies (see Table 1). These can be calculated from peptide sequences.

Table 1

Summary of the 272 molecular descriptors considered as a the universe set of feature for the peptide representation, these are grouped by dimensionality into 0D and 1D.

Group	Name	No. of molecular descriptors	No. of descriptors’ values	Reference
0D	Standard amino acid composition	1	20	[30]
	Reduce amino acid composition	10	41	[31], [32]
	Aliphatic index	1	1	[30]
	Net charge and mean net charge	6	6	[33]
	Grand Average of Hydrophilicity	2	2	[33]
	Grand Average of Hydropathy (GRAVY)	1	1	[30]
	Grand Average of Hydrophobicity	23	23	[33]
	Charge at different pH values (5, 7, and 9)	3	3	[34]
	Boman index	1	1	[35]
	Molecular weight	1	1	[30]
	Number of amino acids	1	1	[30]
1D	Instability index	1	1	[30], [36]
	Reduced amino acid Transition	10	21	[31], [32]
	Reduced amino acid distribution	50	105	[31], [32]
	Dipeptide	1	9	[32]
	Tripeptide	1	27	[32]
	Max mean hydrophobicity	1	1	[37]
	Hydrophobic moment	3	3	[38]
	Isoelectric Point	1	1	[39], [30]
	In vitro aggregation	1	1	[40], [41]
	turn structure propensity	1	1	[40], [41]
	β-sheet propensity	1	1	[40], [41]
	α-helix propensity	1	1	[40], [41]
	Total	122	272

Summary of the 272 molecular descriptors considered as a the universe set of feature for the peptide representation, these are grouped by dimensionality into 0D and 1D. To compute the molecular descriptors, we used two different software packages: Tango software [40], [41], [45] and an in–house MOlecular Descriptor for AntiMicrobial Peptides (MODAMP). Tango was used to calculate four descriptors related to the secondary structure (AGG, TURN, BETA, HELIX), whereas MODAMP was used to compute the remaining 268 descriptors which are listed in Supplementary File 1. In this step, we assumed that each peptide, from the input dataset, was a valid sequence , i.e., each symbol comes from the standard amino acid alphabet of size 20. Consider the set of molecular descriptors , we convert each sequence into a 272-dimensional vector , each component encodes the value for the molecular descriptor of sequence .

Feature subset selection algorithm

Solution representation

The design of a suitable representation for candidate solutions is an essential step in a genetic algorithm; since it defines a mapping of candidate-solutions space, referred to as phenotypic space, to the problem-solving space, referred to as genotypic space. As we described earlier, the phenotype space for the FSSP (1) is the collection of all subsets of the input feature set X, excluding the empty set. Previous works on genetic algorithms for the FSSP have considered a fixed-length binary string to represent the phenotype, where each bit position (fixed to 1 or 0) encodes whether each one of the m features of X is selected or not [46], [27]. However, taking into consideration a large number of molecular descriptors that are computable in peptides and that the candidate solutions are just a subset of them, the binary encoding might generate large chromosomes with only a few bits encoding the features for a candidate solution. For this reason, we considered a variable length representation (VLR) that allows encoding only the features related to the candidate solution. In SAGAFS, a chromosome g is a subset of integers {1,…,m} that encodes the index for each selected feature. Then a given genotype of cardinality k represents the subset . Next we show an example of an individual and its corresponding solution (phenotpye).

Fitness function

The quality of a subset is based on the performance of induced classifiers by a machine learning algorithm and the training dataset with only that subset of features. Additionally, we include a second term, which measures the model complexity in terms of the number of features. Hence, higher performance of indicates a better candidate solution, and if two subsets have the same performance, the simplest one indicates a more suitable candidate. The quality of a given subset , represented by a chromosome g, is defined as,withwhere is the MCC estimated by k-fold cross-validation. Here, is the absolute value of obtained by testing the induced classifier with the validation set , where is trained with the set . The second term in (2) is a tiebreaker criterion to encourage small subsets, where, is a value in the range and m is the cardinality of the universe of features.

Main steps of SAGAFS

At , a population of individuals was randomly generated. In each chromosome () k-integer values out of m available are selected at random, where k is restricted by lower and upper bounds. These bounds are employed to restrict the size of individuals delimiting the dimensionality of the feature space that is sampled by SAGAFS at the moment. Subsequently, the fitness value of each g in is computed by using the function (2). From the current population P(t), individuals are obtained by the standard binary tournament with replacement scheme [47]. The obtained individuals are added and scrambled in the parent set, denoted as . Each pair of consecutive parents ( and in for ) is recombined with probability by using the subset size-oriented common feature crossover operator (SSOCF) [48]. The SSOCF is adapted for our representation (i.e., VLR) because the original version has been designed for fixed-length representations (the details of adapted SSOCF is shown in Supplementary File 2. The SSOCF is used to preserve the common features of the parent into their offspring. As a result, this operator produces two children ( and ) for each pair of parents. For each offspring in the offspring set, denoted as , a k-indel mutation with probability is applied. Conventionally, is a user-defined and a static value, however in SAGAFS, it was dynamically estimated by a self-adaptive mutation method [49]. This method is used to increase the if the current population is over a similarity threshold (i.e,low diversity), otherwise is decreased. In detail, the similarity value of a population is given by , where s is the number of identical individuals in . The self-adaptive mutation probability was calculated as follows:withwhere is a specified threshold, is the initial mutation probability and is the step size. We developed a mutation operator, named k-insertions/deletions (k-indel), for the dimensional variation of a particular offspring. k-indel works by randomly picking k integers from [1,m]. Each integer is inserted or deleted into the offspring, depending on, whether or not the integer is in the offspring. To illustrate this operator, we introduce the following example: Note that in this example feature number 2 is deleted while features number 6 and 9 are added. As a next step in the SAGAFS, we compute the fitness value of each o in the offprings population . To select the chromosomes that will form part of the population in the next generation, we performed the standard survival selection elitism [47], thus, the current population set and the offspring set , each comprises of and individuals, were merged and sorted by their fitness values. After that, the top individuals were selected as the new population (). We defined a stop criterion, in accordance with the maximum number of generations and number of generations without improvement in the objective function given by 2. The best feature subset and its fitness values are provided as an output of SAGAFS. From this information, both, the validation set and the training set were reduced in size (i.e., the dataset set were represented only with the optimal subset of features).

Results and discussions

To evaluate the SAGAFS performance, we run it 30 times for each dataset. Then we selected the best solution obtained for each dataset and compared them with the state-of-art AMP classification methods. The best solution obtained for SAGAFS was compared with publicly available AMP prediction tools. The implemented algorithm and the evaluated datasets are available for download at https://github.com/gdelrioifc/AMPFeatureSelection. The evaluation of this algorithm along with the main results are described next.

Peptide datasets (Benchmarks)

We considered three benchmark datasets widely used for the binary antimicrobial classification task. We used these datasets to measure, in an unbiased way, the performance of molecular descriptors obtained by SAGAFS. They are: DAT1 [4], DAT2 [6], and DAT3 used in [10] was taken from [50], [51]. Fig. 2 shows the overlapping among these datasets, the left part shows the intersection of all datasets, while the right part shows the intersections of their partitions in AMPs and Non-AMPs. It can be observed that the overlap is only among the sets of antimicrobial peptides (AMPs), even though the three datasets used a similar methodology to retrieve non-antimicrobial (Non-AMPs) sequences. A criterion to measure how difficult it is to discriminate a set of AMPs from Non-AMPs, at the sequence level, is by computing their similarity. If this similarity is close to zero then the set is not challenging since a simple sequence-similarity-based algorithm will be able to separate the classes. On the contrary, if this measure is large then the dataset will be difficult to separate at the sequence level. After computing a similarity measure, with Dover Analyzer software [52] at a 30% threshold, we found the following similarity values: DAT1 has 0.88%, DAT2 36.56%, and DAT3 18.83%. This means that, at sequence level, DAT1 should be the easiest dataset to discriminate, while DAT2 should be the hardest one.

Fig. 2

Venn diagram of considered benchmark datasets for SAGAFS’s test. The level of overlap among datasets DAT1 [4], DAT2 [6], and DAT3 [10] corresponds only to AMPs, i.e., there is no intersection between non-antimicrobial peptides of any pair of datasets.

AMP prediction methods

Many methods for AMPs’ classification have been described in the literature, unfortunately, only a few of them are publicly available. We analyzed the performance of four state-of-art AMP classification methods and six publicly available AMP tools. We compared the performance of our approach with the following methods: ANFIS [4], CAMP [7], iAMP-2L [10], and MLAMP [53]. The same datasets reported by these methods were used to perform such comparison; DAT1 was used to compare our method with ANFIS [4], DAT2 was used to compare with CAMP [7], and DAT3 to compare with iAMP-2L [10] and with MLAMP [53].

Classification results

Table 2 shows the mean results obtained by SAGAFS after 30 runs for each dataset. The fitness function computation was performed by using 10-fold cross-validation over 75% of the data for DAT1, 70% for DAT2, and 100% for DAT3, following what the authors of the methods using these data did. In general, SAGAFS has a uniform performance through the 30 runs (i.e., its standard deviation is small). Furthermore, the machine learning algorithm with best performances is Random Forest (RF). The best performance was observed for DAT3 with Acc(%) of and the MCC was . These results outperform the ones recently presented by a method that used a linear projection of features’ subspaces [13] instead of a wrapper to achieve the same goal; we gained this improvement at the expense of a higher computational cost.

Table 2

Dataset	MLA⁎	Acc (%)	Sn	Sp	F-score	MCC	ROC area
DAT1	SVM-L	92.70(±1.51)	0.91(±0.05)	0.94(±0.04)	0.93(±0.02)	0.86(±0.03)	0.93(±0.02)
DAT1	RF	93.76(±1.01)	0.93(±0.02)	0.94(±0.02)	0.94(±0.01)	0.88(±0.02)	0.95(±0.01)

DAT2	SVM-L	82.01(±0.73)	0.81(±0.03)	0.83(±0.03)	0.82(±0.01)	0.63(±0.02)	0.82(±0.01)
DAT2	RF	92.50(±0.40)	0.91(±0.04)	0.93(±0.03)	0.92(±0.00)	0.85(±0.01)	0.97(±0.00)

DAT3	SVM-L	95.12(±0.42)	0.94(±0.01)	0.96(±0.00)	0.95(±0.00)	0.90(±0.01)	0.95(±0.00)
DAT3	RF	96.28(±0.61)	0.96(±0.01)	0.96(±0.01)	0.96(±0.01)	0.93(±0.01)	0.99(±0.00)

MLA, Machine Learning Algorithm: RF = Random Forest; SVM-L = Support Vector Machine-Linear.

Mean performance values with their respective standard deviation of the best solutions obtained with the SAGAFS algorithm for the three benchmark datasets after 30 runs. The results are presented as the mean one standard deviation. MLA, Machine Learning Algorithm: RF = Random Forest; SVM-L = Support Vector Machine-Linear. To study the impact of the optimal set of descriptors obtained by SAGAFS on the classifiers’ performances, we compared the performances achieved when using all candidate molecular descriptors with the results shown in Table 2. The comparison is shown in Fig. 3 and indicates that, on average, the classifier constructed using the solution of SAGAFS is competitive with respect to the base-line classifier, i.e., the classifier that uses the 272 molecular descriptors (i.e., SVM-control, RF-control), the performances are indicated by triangles in Fig. 3. In all cases, the cardinality of the optimal descriptors sets represent a reduction of at least 75% with respect to the size of 272 descriptors. For instance, the best-obtained solution, for DAT3, needed only 39 molecular descriptors to achieve an accuracy of with an MCC of and an ROC area of 0.99.

Fig. 3

Performance comparison among the best solutions obtained by SAGAFS + SVM-L and SAGAFS + RF after 30 runs. The triangles indicate the MCC for the base-line model (upper left and right figures). The lower part (left and right) depicts the percentage of reduction in number of descriptors with respect to the base-line (272 descriptors). Previous works that used DAT1 [4] and DAT2 [6] started with a universe of features that are a subset of the starting set of features used here. However, the work that used DAT3 [10] built a set of features based on amino acid composition, measuring five physicochemical properties (hydrophobicity, pK1, pK2, pI, and molecular weight) for each of the 20 amino acids, for an initial set of 100 descriptors. The final set of features for the work dealing with DAT1 was composed of two features, in vitro aggregation and peptide length, the latter also selected by our wrapper method, while that using DAT2 [6] ends up with 64 features, unfortunately these were not provided by the authors. Hence, at this point, we could see some similarities of the features previously selected to model AMP, but it is not possible to fully compare them with those observed in our study.

Relevance of selected features

Since the feature selection algorithm (SAGAFS) is run 30 times, each one of these runs generates a set of features that corresponds to the fittest individual found in that run. Then, for every feature in the 30 sets of features we compute its relative frequency. Fig. 4 shows the indices of the most frequently selected features. Each graph depicts the top 10 most frequently selected features when SAGAFS is run. The top rows show the results when SAGAFS uses a SVM method applied to DAT1 (first column), DAT2 (second column), and DAT3 (third column). The bottom row shows the same results but now when SAGAFS uses a Random Forest classifier. For instance, for DAT1, the most selected feature for both models (SVM and RF) is the feature given by the index number 268 which corresponds to in vitro aggregation.

Fig. 4

Most frequently selected features for SAGAFS on each dataset. The plots in the lower part represent the indices for the most frequent features for the model generated by Random Forest (RF), while the plots in the upper part show the indices for the SVM-L. If we analyze the top 10 features, as they are selected from the best solution in every one of the 30 SAGAFS runs, that simultaneously appear under both learning models (i.e., under SVM-L and RF), we found the following coincidences. For DAT1, in vitro aggregation (268), length (257), electric charge (27), and maximum of the mean hydrophobicity (263) appear in both models. For DAT2, molecular weight (258), length (257), and electric charge (22). For DAT3, frequency of Metionine (10), amphipathicity (28), frequency of Tryptophane (18), and solvent accessibility of certain k-mers (92). The biological significance of some of these features has already been indentified in previous works, for instance, net charge, amphipathicity and hydrophobicity properties were found to be relevant for the antimicrobial activity [4]. On the other hand, Tryptophan has already noted to be present in a family of archetypal AMP [54], yet Methionine has not (see for instance [55], [56]). Our results suggest that Methionine may be enriched on AMP with respect to the non-AMP, despite being under-represented in AMP. We believe these results may promote further investigation onto the role of such amino acid on AMP function.

Performance comparison with state-of-art classifiers

Table 3 compares the performances of SAGAFS witn ANFIS [4] on DAT1, where ANFIS outperforms SAGAFS on the testing dataset and the opposite occurs on the validation dataset, while the overall performance remains similar for both algorithms. The results are in accordance with the low similarity between AMPs and Non-AMPs’ sequences for this dataset, i.e., we expected to have this high performance results since the classes are not hard to separate even at the sequence level.

Table 3

Performance comparison of SAGAFS method with ANFIS [4] on the dataset DAT1.

Method	MLAa	Dataset	ACC(%)	Sn	Sp	F1-score	MCC
[4]	ANFIS	Training	96.23	1.00	0.93	0.96	0.93
		Testing	100	1.00	1.00	1.00	1.00
		Validation	94.34	0.96	0.92	0.95	0.89
		Overall	96.73	0.99	0.95	0.97b	0.94

SAGAFS	RF	Training	100	1.00	1.00	1.00	1.00
		Testing	84.48	0.88	0.79	0.84	0.70
		Validation	100	1.00	1.00	1.00	1.00
		Overall	96.89	0.97	0.97	0.97	0.94

Machine Learning Algorithm (MLA): RF = Random Forest; ANFIS = Adaptive Neuro-fuzzy Inference System.

Bold font indicates the best value per measure.

Performance comparison of SAGAFS method with ANFIS [4] on the dataset DAT1. Machine Learning Algorithm (MLA): RF = Random Forest; ANFIS = Adaptive Neuro-fuzzy Inference System. Bold font indicates the best value per measure. The comparison of SAGAFS with CAMP on DAT2 is shown in Table 4. This is the hardest dataset according to the similarity measure used. The performances of SAGAFS and CAMP are similar in MCC and ACC(%) metrics.

Table 4

Performance comparison of SAGAFS method and CAMP [7] on the dataset DAT2.

Method	MLAa	MCC		Performance in (%)			10-fold CV
		Train	Test	Sn	Sp	ACC	ACC(%)
[7]	RF	0.82	0.84	90.8	93.7	92.5	93.4
	SVM	0.91	0.83	89.7	93.1	91.6	92.6
	ANN	0.72	0.72	82.9	88.9	86.3	86.9
SAGAFS	RF	0.87	0.84	88.5	95.14	92.4	93.3

Machine Learning Algorithm (MLA): RF = Random Forest; SVM = Support Vector Machine with polynomial kernel (degree 4); ANN = Artificial Neural Network. Bold font indicates the best value per measure.

Performance comparison of SAGAFS method and CAMP [7] on the dataset DAT2. Machine Learning Algorithm (MLA): RF = Random Forest; SVM = Support Vector Machine with polynomial kernel (degree 4); ANN = Artificial Neural Network. Bold font indicates the best value per measure. Table 5 compares the performance of SAGAFS with iAMP-2L [10] and MLAMP [53] on the DAT3. The performance achieved by SAGAFS is higher than the performances reported by iAMP-2L and MLAMP, over all performance measures. This is the second hardest dataset according to the similarity measure employed.

Table 5

Performance comparison of SAGAFS method with iAMP-2L [10] and MLAMP [53] on the dataset DAT3.

Method	MLAa	Sn(%)	Sp(%)	ACC(%)	MCC
iAMP-2L [10]	FKNN	87.13	86.03	86.32	0.727
MLAMP [53]	RF	77.00	94.60	89.90	0.737
SAGAFS	RF	96.64	97.36	97.00	0.940

Machine Learning Algorithm (MLA): RF = Random Forest; FKNN = Fuzzy K-Nearest Neighbor. Bold font indicates the best value per measure.

Performance comparison of SAGAFS method with iAMP-2L [10] and MLAMP [53] on the dataset DAT3. Machine Learning Algorithm (MLA): RF = Random Forest; FKNN = Fuzzy K-Nearest Neighbor. Bold font indicates the best value per measure.

Conclusion

A novel and effective evolutionary algorithm to solve the feature selection problem in antimicrobial peptides classification has been proposed. The approach combines two algorithms, a Genetic Algorithm and a variable length evolutionary algorithm with an objective function that combines the classifier’s MCC measure with the chromosome length, i.e. the number of selected descriptors. Results from computational experiments show that the proposed method is able to find a representation for the peptides capable of generating models that outperform state-of-the-art methods that are publicly available for AMP classification. Our findings suggest that our approach could be used in preliminary computational screening in order to identify novel antimicrobial peptides, efficiently. Future research is aimed at extending the comparison with other available AMP predictors over a larger dataset. We are also planning to apply our SAGAFS algorithm to multi-class classification of AMPs, i.e., once you know that a given peptide is AMP, identify its specific function.

CRediT authorship contribution statement

Jesus A. Beltran: Conceptualization, Methodology, Writing - original draft, Software, Validation. Gabriel Del Rio: Conceptualization, Writing - review & editing, Validation. Carlos A. Brizuela: Conceptualization, Methodology, Validation, Writing - review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

4 in total