Literature DB >> 31372196

AtbPpred: A Robust Sequence-Based Prediction of Anti-Tubercular Peptides Using Extremely Randomized Trees.

Balachandran Manavalan¹, Shaherin Basith¹, Tae Hwan Shin¹, Leyi Wei², Gwang Lee¹.

Abstract

Mycobacterium tuberculosis is one of the most dangerous pathogens in humans. It acts as an etiological agent of tuberculosis (TB), infecting almost one-third of the world's population. Owing to the high incidence of multidrug-resistant TB and extensively drug-resistant TB, there is an urgent need for novel and effective alternative therapies. Peptide-based therapy has several advantages, such as diverse mechanisms of action, low immunogenicity, and selective affinity to bacterial cell envelopes. However, the identification of anti-tubercular peptides (AtbPs) via experimentation is laborious and expensive; hence, the development of an efficient computational method is necessary for the prediction of AtbPs prior to both in vitro and in vivo experiments. To this end, we developed a two-layer machine learning (ML)-based predictor called AtbPpred for the identification of AtbPs. In the first layer, we applied a two-step feature selection procedure and identified the optimal feature set individually for nine different feature encodings, whose corresponding models were developed using extremely randomized tree (ERT). In the second-layer, the predicted probability of AtbPs from the above nine models were considered as input features to ERT and developed the final predictor. AtbPpred respectively achieved average accuracies of 88.3% and 87.3% during cross-validation and an independent evaluation, which were ~8.7% and 10.0% higher than the state-of-the-art method. Furthermore, we established a user-friendly webserver which is currently available at http://thegleelab.org/AtbPpred. We anticipate that this predictor could be useful in the high-throughput prediction of AtbPs and also provide mechanistic insights into its functions.

Entities: Chemical Disease Gene Species

Keywords: Antitubercular peptide; Cross-validation; Extremely randomized tree; Tuberculosis; Two-layer framework; Two-step feature selection

Year: 2019 PMID： 31372196 PMCID： PMC6658830 DOI： 10.1016/j.csbj.2019.06.024

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Tuberculosis (TB) is a major deadly disease caused by an infection from the bacterium Mycobacterium tuberculosis [1,2]. The World Health Organization (WHO) recently reported that around 10 million people were infected with TB and 1.6 million people died from the disease in 2017 alone (https://www.who.int/en/news-room/fact-sheets/detail/tuberculosis). Despite evolving treatment strategies, its prevalence is increasing owing to inadequate drug use, poor-quality drugs, and the premature discontinuation of treatment by patients, leading to amplified drug-resistant strains of Mycobacterium tuberculosis [3]. Multidrug-resistant tuberculosis (MDR-TB) is a form of TB caused by bacteria that do not respond to first-line anti-TB drugs, like rifampicin and isoniazid [4]. Owing to its high prevalence, MDR-TB is considered a public health crisis and a health security threat (https://www.who.int/en/news-room/fact-sheets/detail/tuberculosis). Another form of MDR-TB, extensively drug-resistant tuberculosis, does not respond to second-line anti-TB drugs and accordingly is much more serious. Furthermore, the adverse side effects of anti-tubercular drugs and prolonged treatment durations are major issues. Therefore, it is necessary to find alternative effective therapeutics with new mechanisms of anti-tubercular action or methods to potentiate existing drugs with minimized treatment periods and costs. Since the serendipitous finding of penicillin in the 1920s, peptide-based therapy has gained momentum in the area of drug discovery. One of the most influential characteristics of peptides is their pleiotropic effects towards a wide range of biological targets, thus making them better drug candidates with lower toxicity than that of small molecules. Several studies have shown the anti-tubercular properties of peptides and their unique mechanism of actions, suggesting that they are an ideal approach for TB management [5]. Peptides derived from human immune and non-immune cells, venom, fungi, bacteria, and several other sources have been shown to act as effective anti-tubercular agents [3]. Some aspects of anti-tubercular peptides (AtbPs), like their low immunogenicity, selective affinity to negatively charged bacterial cell envelopes, and targeted immune responses against bacterium, make them attractive alternatives to conventional TB drugs [3,6]. Owing to the growing interest in peptide-based therapy for the identification of effective anti-tubercular agents, their design and discovery process should be accelerated. However, the experimental identification and development of AtbPs is time-consuming and expensive. To assist and expedite the discovery of AtbPs, in silico methods are needed prior to their synthesis [5]. A trending computational method, machine learning (ML), could provide an excellent platform for the prediction and design of AtbPs. To the best of our knowledge, only one ML-based method has been developed for AtbP prediction [7]. The authors utilized five different classifiers (support vector machine (SVM), random forest (RF), SMO, J48, Naïve-Bayes) and four different feature encodings (amino acid composition (AAC), dipeptide composition (DPC), binary profiles (NC5) and terminal composition). Two prediction models, such as SVM-based ensemble model and SVM using hybrid features achieved better performances [7]. Despite encouraging results, the method has certain limitations in terms of efficiency and accuracy. Furthermore, feature selection techniques for optimization of features and exploration of other feature encodings were not implemented. Henceforth, in this study, we proposed a novel sequence-based two-layer predictor, i.e. AtbPpred, for the classification of AtbPs or non-AtbPs from given peptide sequences. To develop a prediction model, we explored nine different feature encodings that covers various aspects of sequence information, including AAC, DPC, NC5, composition-transition-distribution (CTD), quasi-sequence-order (QSO), amino acid index (AAI), conjoint triad (CTF), grouped tripeptide composition (GTPC), and grouped dipeptide composition (GDPC). Subsequently, a two-step feature selection protocol was applied on each encoding and identified their corresponding optimal feature set. In the first-layer, nine optimal feature set-based prediction models were developed using extremely randomized tree (ERT), whose predicted probability scores were further considered as input features to ERT in the second-layer and developed the final model. Comparative results for AtbPpred and the state-of-the-art method using benchmark and independent datasets showed remarkable improvements. Therefore, we foresee that our work will pave way for the development of novel computational methods and facilitate experimentalists in the discovery of novel AtbPs.

Materials and Methods

Proposed Predictor Framework

The overall procedure for the proposed peptide sequence-based predictor is shown in Fig. 1. It consists of four steps: (i) construction of benchmark and independent datasets; (ii) feature extraction and optimization using two-step feature selection (F-score algorithm was applied for feature ranking, followed by a sequential forward search (SFS)) protocol; and (iii) using ERT, the optimal feature set of nine different encodings based prediction models were developed individually in the first layer, whose predicted probability of AtbPs were integrated and further considered as input features to ERT for the development of the final prediction model in the second layer. (iv) performance assessment and webserver development. In our predictor, the predicted outcome for each sequence is 0 or 1, where 0 denotes non-AtbP and 1 denotes AtbP.

Fig. 1

Framework of the proposed algorithm. It consists of four steps: (i) dataset construction, (ii) feature extraction and their optimization using two-step feature selection protocol, (iii) construction of two-layer prediction model, and (iv) assessment of performance and development of webserver.

Construction of Benchmark and Independent Datasets

Recently, Usmani et al. [7] constructed antitubercular peptide datasets and developed prediction models. Basically, two datasets were constructed, namely AntiTb_MD and AntiTb_RD, where both datasets were comprised of identical positive samples (246 AtbPs), but different negative samples (246 non-AtbPs). AntiTb_RD and AntiTb_MD respectively contained random peptides generated from Swiss-Prot and anti-bacterial peptides. In this study, we utilized both the datasets and developed a prediction model separately. All positive samples contained true experimentally validated ATbPs derived from the public database, AntiTbpdb [8]. Table S1 shows the distribution of ATbPs from different species, where major contribution is from Mycobacterium tuberculosis (66%), moderate contribution is from Mycobacterium smegmatis (24%), and other species contribution is marginal. Moreover, the sequence lengths in two separate datasets were in the range of 5 to 61 amino acids. We utilized the same dataset for the following reasons: (i) comprised of most recently updated datasets; (ii) contains nonredundant data; and (iii) allowance of a fair comparison between our proposed and the existing methods. For each dataset, 80% of the samples (199 AtbPs and 199 non-AtbPs) were randomly selected for the development of a prediction model and the remaining 20% of samples (47 AtbPs and 47 non-AtbPs) were utilized for model evaluation.

Feature Extraction

Generally, feature extraction is one of the important steps in designing well-performed classifiers. It generates a fixed length of feature vectors from the given peptide sequences that have varied lengths. Features used in this work can be categorized into two major groups: sequence-based features and physicochemical-based features.

Sequence-Based Features

The differences between positive and negative samples can be reflected by amino acid sequences bearing various factors, including profiles, composition, permutation and combination modes of amino acids, and physicochemical properties. Here, we extracted seven types of sequence-based features: AAC, DPC, QSO, CTF, GTPC, GDPC, and NC5. Amino acid composition (AAC) AAC reflects the occurrences of standard amino acids in a given peptide normalized by the sequence length and is widely applied in bioinformatics [[9], [10], [11]]. It has a fixed length of 20 features, which can be formulated as follows:where fv= (i = 1,2,3…,20) is the normalized frequency of the ith amino acid in a given peptide. R is the quantity of type i observed in a peptide. Dipeptide composition (DPC) DPC reflects the composition of a residue pair (e.g. Ala-Ala, Ala-Cys) occurring in a given peptide, further describing the fraction of amino acids and their local order [7,12]. It has a fixed length of 400 vectors, which can be formulated as follows:where fv represents the frequency of jth amino acid pair in {AA, AC, AD, AE,…,YY}. Binary profile (NC5) In the NC5, each amino acid is encoded as a 20-dimensional 0/1 vector. For instance, the amino acids of type A (b(A)) and type C (b(C)) are encoded as (1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) and (0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), respectively. For a given peptide P, its N- or C– terminus with a length of m amino acids was translated as follows:where the BPF(m) dimension is 20 × m, where m is assigned a value of 5 at both termini to obtain BPFN5 and BPFC5. Furthermore, these two termini (NC5) were combined to generate 200-dimensional feature vector. Grouped tripeptide composition (GTPC) In GTPC encoding, amino acids are divided into five categories according to their physicochemical properties: aliphatic group (G, A, V, L, M, and I), aromatic group (F, Y, and W), positive charge group (K, R, and H), negative charged group (G, D, and E) and uncharged group (S, T, C, P, N, and Q). The tripeptide composition of these five categories generates a fixed length of 125-dimensional feature vector. Grouped dipeptide composition (GDPC) In GDPC encoding, the dipeptide composition of five categories of amino acid physicochemical properties (listed in GTPC) generates a fixed length of 25-dimensional feature vector. Quasi-sequence-order (QSO) QSO encoding of the given peptide sequence results in a fixed length of a 100-dimensional feature vector, by measuring the physicochemical distance between the amino acids within the sequence. A detailed description of QSO feature encoding along with a set of equations has been provided in previous studies [13,14]. Conjoint triad (CTF) CTF encoding generates a 343-dimensional feature vector for a given peptide sequence by clustering amino acids into seven classes according to their dipoles and side chain volumes. A detailed description of CTF with a set of equation has been reported previously [15].

Physicochemical Properties-Based Features

Composition-Transition-Distribution (CTD) In the CTD feature, composition (C) indicates the composition of amino acids, transition (T) signifies the percentage of amino acid residues with certain characteristic that are followed by other amino acids, and distribution (D) measures the sequence length within which 1%, 25%, 50%, 75%, and 100% of the amino acids with certain characteristics are located. In CTD, composition, transition, and distribution are respectively encoded as a 21, 21, 105-dimensional feature vector. Amino acid index (AAI) Previously, eight high-quality AAIs (accessions LIFS790101, TSAJ990101, MAXF760101, BIOV880101, CEDJ970104, BLAM930101, MIYS990104, and NAKH920108) were identified from 566 total AAIs in the AAIndex database [16,17] by applying a clustering technique [18]. AAI generates a 160 (=20 amino acids × 8 properties) dimensional vector, which has been widely applied in numerous sequence-based prediction tasks [19].

Feature Optimization

Feature optimization is one of the important steps in ML [20] that has been used in the improvisation of classification performance. In this study, an F-score algorithm with a SFS protocol was used to filter out noisy and irrelevant features, after which a subset containing optimal features was selected. This two-step protocol has been successfully applied in various predictions [[21], [22], [23]]. In the first step, an F-score algorithm is used to rank the actual features, and to sort these features in a descending order, thereby generating a ranked feature list. The F-score of the ith feature is defined as:where , and represent mean values of the ith feature in the combined (both positive and negative), positive, and negative datasets, respectively. n+ and n− represent the number of positive and negative samples, respectively. and represent the ith feature of jth positive instance and ith feature of jth negative instance, respectively. In the second step, two features were chosen from the ranked features list, and added sequentially as an input feature to ERT which were further utilized for training and developing the corresponding prediction models. Ultimately, the features corresponding to the model with highest accuracy were recognized as optimal features for the respective ML classifier.

Machine Learning Algorithms

We explored the commonly used six different ML algorithms, including ERT, gradient boosting (GB), k-nearest neighbor (KNN), logistic regression (LR), random forest (RF), and SVM. Another five ML-based algorithms have been discussed previously [19,[24], [25], [26], [27], [28], [29], [30]], whose parameter search range is given in supplementary Table S2. Since ERT implemented in AtbPpred exhibited better performance than other ML methods, a brief description of this method and its utilization is detailed below.

Extremely Randomized Tree (ERT)

ERT, another powerful decision tree based method developed by Geurts et al. [31], has been widely used in various sequence-based prediction problems [19,32]. ERT incorporates a stronger randomization technique that reduces the variance of the model. The ERT algorithm is similar to that of RF, except for two main differences: (i) ERT uses all training samples to construct each tree with varying parameters, rather than the bagging procedure used in RF; and (ii) ERT randomly chooses the node split upon construction of each tree, rather than the best split used in RF. In this study, the ERT algorithm was implemented using the scikit-learn (v 0.18.1) library in Python [33]. The grid search approach is used for optimizing the number of trees (ntree), number of randomly selected features (mtry), and minimum number of samples required to split an internal node (nsplit) of the ERT algorithm. The search ranges for the three parameters were 50 ≤ntree≤ 2000 with a step size of 25, 1 ≤mtry≤ 15 with a step size of 1, and 1 ≤nsplit≤ 12 with a step size of 1, respectively.

10-Fold Cross-Validation (CV)

Three CV methods, i.e. an independent dataset test, a sub-sampling (or k-fold CV) test, and a leave-one-out CV (LOOCV) test, are frequently used to calculate the expected success rate of a developed predictor [34,35]. Among the three methods, the LOOCV test is deemed the least arbitrary and most objective, as demonstrated by Eqs. 28–32 in ref. [36]. Although it is widely used to examine the quality of various predictors [[37], [38], [39], [40], [41], [42], [43]], it is time- and resource-intensive. Thus, 10-fold CV was used to examine the proposed models. In 10-fold CV, the training dataset was randomly partitioned into 10 subsets. One subset was used as a test set and the remaining nine subsets were used as the training sets. This procedure is repeated 10 times, where each subset is treated as a test set at least once. Results were averaged to obtain the performance of the classifier.

Performance Evaluation

To evaluate the performance of the constructed models, four common measurements in binary classification tasks were used [23,29,[44], [45], [46], [47], [48]], i.e. sensitivity, specificity, accuracy, and Matthews correlation coefficients (MCC). They were calculated as follows:where TP is the number of true positives (i.e. AtbPs classified correctly as AtbPs) and TN is the number of true negatives (i.e. non-AtbPs classified correctly as non-AtbPs). FP is the number of false positives (i.e. AtbPs classified incorrectly as non-AtbPs) and FN is the number of false negatives (i.e. non-AtbPs classified incorrectly as AtbPs). Additionally, the receiver operating characteristic (ROC) curve, was generated to visually evaluate the comprehensive performance of different classifiers.

Results and Discussion

Evaluation of Various ML-Classifiers Using Nine Different Feature Encodings

Generally, exploring different classifiers using the same dataset is essential, rather than selecting a specific classifier [20,22,49]. Hence, we explored commonly used six different ML algorithms (ERT, RF, SVM, GB, LR, and KNN) [[50], [51], [52], [53], [54]] to evaluate the effectiveness of the ML method in AtbP prediction. For a fair comparison, all ML classifiers were trained and evaluated using 10-fold CV on benchmark datasets, whose corresponding ML parameters were tuned and the optimal parameters were identified using a grid search procedure. To generate a robust prediction model, 10-fold CV was repeated 10 times for each classifier by random partitioning of the benchmark datasets, which lead to ten optimized ML parameters for each classifier. However, we considered the median ML parameter estimates to develop a final prediction model for each feature encoding. Fig. 2A and B respectively represent the performance of six different classifiers on AntiTb_MD and AntiTb_RD datasets. To get an overview of each classifier performance, we computed the average MCC from nine different feature encodings (Fig. 2A), where ERT, GB, KNN, LR, RF, and SVM respectively achieved an average MCC of 0.550, 0.535, 0.379, 0.344, 0.540, and 0.453. Particularly, ERT showed 1.0–20.6% higher MCC scores than that of other five classifiers, demonstrating its superiority in AtbPs prediction. From Fig. 2B, ERT, GB, KNN, LR, RF, and SVM respectively achieved an average MCC of 0.687, 0.666, 0.558, 0.586, 0.666, and 0.638. Specifically, ERT showed 2.1–12.9% higher MCC scores when compared to other classifiers. Overall, ERT classifier attained better performance when compared to other five classifiers regardless of the datasets. Hence, we considered only ERT classifier for further analysis.

Fig. 2

Performance of various classifiers in distinguishing between AtbPs and non-AtbPs with respect to nine feature descriptors. (A) and (B) respectively represent performances based on AntiTb_MD and AntiTb_RD benchmark datasets.

Selection of Optimal Features for Each Encoding

To examine whether the feature selection protocol could improve each encoding-based prediction model performance, we applied a commonly used two-step feature selection protocol (i.e. F-score based ranking, followed by SFS) on each encoding. Fig. 3A shows the accuracy curves with gradual addition of features from the ranked feature list for the ERT classifier based on nine different encodings using AntiTb_MD dataset. Results showed that the six feature encodings (DPC, NC5, QSO, CTF, GTPC, and GDPC), whose accuracy curve gradually improved, reached its maximum point and subsequently declined upon the addition of ranked features. Conversely, three encodings (AAC, AAI, and CTD) reached its maximum point using all features, further indicating the equal significance of all features. Using AntiTb_RD dataset, the accuracy curve gradually improved for six feature encodings (CTD, DPC, NC5, QSO, CTF, and GTPC) and reached its maximum point, which then declined upon the addition of ranked features (Fig. 3B). However, a maximum point was observed for three encodings (AAC, AAI, and GDPC) with all features. To check the improvement, we computed the difference in accuracy between the optimal feature set and control (using all features). Models developed using AntiTB_MD dataset (Fig. 4A) showed a major improvement (>3%) for three encodings (GTPC, NC5, and DPC), a marginal improvement (<1%) for three encodings (QSO, CTF, and GDPC), and a tie for three encodings (AAC, AAI, and CTD). Using AntiTB_RD dataset, a major improvement was observed only for DPC encoding (Fig. 4B), a slight improvement (< 3%) for four encodings (CTF, NC5, CTF, and GTPC), and a tie for four encodings (AAC, AAI, QSO, and GDPC). Furthermore, the selected optimal features are significantly reduced for both datasets, where AntiTB_MD (Fig. 4C) and AntiTB_RD (Fig. 4D) optimal features respectively contained 59.1% and 51.7% from the total nine feature encodings. Overall, feature selection protocol improved majority of feature encoding performances and significantly reduced the feature dimension on both AntiTB_MD and AntiTB_RD datasets.

Fig. 3

Fig. 4

Comparison between original features and optimal features in terms of performance and feature dimension. The percentage of improvement in accuracy calculated between the control and the optimal feature set is shown in (A) and (B), respectively represent AntiTb_MD and AntiTb_RD. The percentage of selected features (optimal features) from the original features shown in (C) and (B), respectively represent AntiTb_MD and AntiTb_RD.

Sequential forward search (SFS) for discriminating between AtbPs and non-AtbPs. The maximum accuracy (SFS peak) obtained from 10-fold cross-validation is shown for each feature encoding. (A) and (B) respectively represent performances based on AntiTb_MD and AntiTb_RD benchmark datasets. Comparison between original features and optimal features in terms of performance and feature dimension. The percentage of improvement in accuracy calculated between the control and the optimal feature set is shown in (A) and (B), respectively represent AntiTb_MD and AntiTb_RD. The percentage of selected features (optimal features) from the original features shown in (C) and (B), respectively represent AntiTb_MD and AntiTb_RD.

Construction of AtbPpred

Two-layer prediction model has been successfully applied to address various biological problems [20,22,27,28,55,56]. Inspired by these studies, we implemented a similar approach for AtbPs prediction. The optimal models obtained for each encoding from the above step are considered as the first layer models (Table 1). Since the performance pattern is similar between two datasets (AntiTB_MD and AntiTB_RD), we focused only on AntiTB_MD. Using AntiTB_MD dataset, DPC encoding appeared to be the most powerful encoding and it achieved the MCC and accuracy of 0.664 and 0.832, respectively (Table 1). Whereas, the remaining eight encodings also achieved a reasonable performance with accuracies ranging from 75 to 80%, further indicating its practicality in AtbPs prediction due to their complementary feature representation from a different perspective. Instead of selecting the best model from Table 1, we considered all these models to generate a robust and final prediction in the second layer. Basically, we considered the predicted probability of AtbPs (values ranging from 0.0 to 1.0) from nine individual optimal models as input features to ERT and developed a final prediction model called AtbPpred.

Table 1

Performance of various classifiers on the benchmark dataset.

Dataset	Encoding	MCC	Accuracy	Sensitivity	Specificity	AUC
AntiTb_MD	AAC	0.594	0.797	0.764	0.829	0.853
	AAI	0.588	0.792	0.724	0.859	0.857
	CTD	0.559	0.769	0.633	0.905	0.809
	CTF	0.599	0.799	0.774	0.824	0.849
	DPC	0.664	0.832	0.809	0.854	0.886
	GDPC	0.506	0.751	0.694	0.809	0.798
	GTPC	0.604	0.802	0.774	0.829	0.837
	NC5	0.568	0.784	0.779	0.789	0.826
	QSO	0.548	0.774	0.769	0.779	0.845
AntiTb_RD	AAC	0.715	0.852	0.764	0.940	0.909
	AAI	0.708	0.844	0.729	0.960	0.906
	CTD	0.665	0.832	0.799	0.864	0.883
	CTF	0.765	0.882	0.859	0.905	0.908
	DPC	0.820	0.910	0.889	0.930	0.945
	GDPC	0.635	0.817	0.779	0.853	0.883
	GTPC	0.674	0.837	0.814	0.859	0.889
	NC5	0.684	0.839	0.774	0.905	0.878
	QSO	0.708	0.849	0.769	0.930	0.881

The first and the second column represent the dataset and the feature encoding employed in this study. The third, fourth, fifth, sixth, and the seventh columns, respectively represent the MCC, accuracy, sensitivity, specificity, and AUC.

Performance of various classifiers on the benchmark dataset. The first and the second column represent the dataset and the feature encoding employed in this study. The third, fourth, fifth, sixth, and the seventh columns, respectively represent the MCC, accuracy, sensitivity, specificity, and AUC. AtbPpred based on AntiTB_MD dataset achieved the best performance with the MCC, accuracy, sensitivity, specificity, and area under the curve of 0.700, 0.849, 0.819, 0.879, and 0.909, respectively. To show the effectiveness of AtbPpred, we compared its performance with nine feature encoding predictors (Fig. 5A). Specifically, the MCC and accuracy of the proposed predictor was 3.57–19.4% and 1.7–9.8% higher than the individual predictors, thus indicating the effectiveness of our approach by integrating various feature encodings, leading to an improved performance. Similarly, AtbPpred based on AntiTB_RD dataset achieved the best performance with the MCC, accuracy, sensitivity, specificity, and area under the curve of 0.834, 0.917, 0.905, 0.930, and 0.942, respectively. Specifically, the MCC and accuracy of the proposed predictor was 0.7–19.9% and 0.8–1.0% higher than the individual predictors (Fig. 5B), thus indicating the effectiveness of our approach.

Fig. 5

Performance comparison of AtbPpred and nine different feature encodings based methods. (A) and (B) respectively represent the performances based on AntiTb_MD and AntiTb_RD benchmark datasets. (C) and (D) respectively represent performances based on AntiTb_MD and AntiTb_RD independent datasets.

Performance of AtbPpred on Independent Dataset

To assess the generalization, robustness, and practical application of our proposed method, we evaluated its performance using the independent dataset and compared the results with those obtained using first-layer models. AtbPpred based on AntiTB_MD achieved the best performance with the MCC, accuracy, sensitivity, specificity and area under the curve of 0.793, 0.894, 0.830, 0.957, and 0.934, respectively (Fig. 5C). Specifically, the MCC and accuracy of the proposed predictor was 7.9–25.1% and 4.3–12.8% higher than the first layer models. Similarly, AtbPpred based on AntiTB_MD achieved the best performance with the MCC, accuracy, sensitivity, specificity, and area under the curve of 0.704, 0.851, 0.809, 0.894, and 0.899, respectively (Fig. 5D). Specifically, the MCC and accuracy of the proposed predictor was 2.2–16.8% and 1.0–8.5% higher than the first layer models, thus indicating the utility and robustness of our two-layer approach. To summarize, the experiments based on the independent dataset highlights the importance and requisite to employ more comprehensive and discriminative feature encodings and integrate them into a consolidated framework to further enhance the model design and performance.

Comparison of AtbPpred With the Existing Method

We developed two prediction models using the AntiTb_MD and AntiTb_RD datasets and compared their performances with the state-of-the art method, Antitbpred. Notably, Antitbpred also contains two prediction models using two different datasets. The rationale for considering this method in our analysis is as follows: (i) the authors trained and validated their prediction models using the same training dataset presented in this study and (ii) this method has been reported to demonstrate excellent performance in AtbP identification. First, we compared the performance of our proposed AtbPpred method on two benchmarking datasets. As shown in Table 2, AtbPpred achieved an overall better performance than Antitbpred in terms of MCC, accuracy, sensitivity, and specificity on two benchmarking datasets. Using a P-value threshold of 0.01, our proposed method significantly outperformed Antitbpred. Secondly, we compared the performance of AtbPpred on two independent datasets. AtbPpred achieved an overall better performance than Antitbpred in terms of MCC, accuracy, sensitivity, and specificity on two independent datasets (Table 3). Inclusively, AtbPpred shows the best and consistent performance on both benchmark and independent datasets, further suggesting its ability to perform well with unknown peptides when compared to the existing method.

Table 2

Performance of various classifiers on the benchmark dataset.

Dataset	Methods	MCC	Accuracy	Sensitivity	Specificity	AUC	P-value
AntiTb_MD	AtbPpred	0.700	0.849	0.819	0.879	0.909	–
AntiTb_MD	Antitbpred	0.550	0.775	0.768	0.773	0.820	0.000656
AntiTb_RD	AtbPpred	0.834	0.917	0.905	0.930	0.942	–
AntiTb_RD	Antitbpred	0.640	0.817	0.787	0.846	0.870	0.001013

The first and the second column represent the dataset and the classifier name employed in this study. The third, fourth, fifth, sixth, and the seventh columns respectively represent the MCC, accuracy, sensitivity, specificity, and AUC. For comparison, we have included Antitbpred metrics reported in the literature [7]. The last column represents the pairwise comparison of ROC area under curves (AUCs) between AtbPpred and the Antitbpred using a two-tailed t-test. P < .01 indicates a statistically meaningful difference between AtbPpred and the selected method (shown in bold).

Table 3

Performance of various classifiers on the independent dataset.

Dataset	Methods	MCC	Accuracy	Sensitivity	Specificity	AUC	P-value
AntiTb_MD	AtbPpred	0.793	0.894	0.830	0.957	0.934	–
AntiTb_MD	Antitbpred	0.520	0.759	0.750	0.767	0.830	0.020
AntiTb_RD	AtbPpred	0.705	0.851	0.809	0.894	0.899	–
AntiTb_RD	Antitbpred	0.570	0.785	0.733	0.838	0.860	0.4470

The first and the second column represent the dataset and the classifier name employed in this study. The third, fourth, fifth, sixth, and the seventh columns respectively represent the MCC, accuracy, sensitivity, specificity, and AUC. For comparison, we have included Antitbpred metrics reported in the literature [7]. The last column represents the pairwise comparison of ROC area under curves (AUCs) between AtbPpred and Antitbpred using a two-tailed t-test. P < .05 indicates a statistically meaningful difference between AtbPpred and the selected method (shown in bold).

Performance of various classifiers on the benchmark dataset. The first and the second column represent the dataset and the classifier name employed in this study. The third, fourth, fifth, sixth, and the seventh columns respectively represent the MCC, accuracy, sensitivity, specificity, and AUC. For comparison, we have included Antitbpred metrics reported in the literature [7]. The last column represents the pairwise comparison of ROC area under curves (AUCs) between AtbPpred and the Antitbpred using a two-tailed t-test. P < .01 indicates a statistically meaningful difference between AtbPpred and the selected method (shown in bold). Performance of various classifiers on the independent dataset. The first and the second column represent the dataset and the classifier name employed in this study. The third, fourth, fifth, sixth, and the seventh columns respectively represent the MCC, accuracy, sensitivity, specificity, and AUC. For comparison, we have included Antitbpred metrics reported in the literature [7]. The last column represents the pairwise comparison of ROC area under curves (AUCs) between AtbPpred and Antitbpred using a two-tailed t-test. P < .05 indicates a statistically meaningful difference between AtbPpred and the selected method (shown in bold).

Feature Selection Analysis

To explain the improved performance after feature optimization, we compared the spatial distribution between the optimal and original features. For an intuitive comparison, T-distributed stochastic neighbor embedding (t-SNE) implemented in Scikit with default parameters (n_components = 2, perplexity = 30, and learning_rate = 100) was employed for each encoding to reduce the multi-dimensional space into a two-dimensional one. Here, we compared nine probabilistic features that were obtained from the first layer with the top five feature encodings (CTD, DPC, AAI, AAC, and QSO) on AntiTB_RD dataset. Fig. 6 shows t-SNE distribution of the original and optimal features in the two-dimensional space. As shown in Fig. 6A–E, the positive (AtbPs) and negative (non-AtbPs) samples in the original feature space overlapped, indicating that the original feature space cannot effectively separate AtbPs from non-AtbPs. Conversely, probabilistic features (Fig. 6E) showed that most of the positives and negative samples in the feature space could be easily differentiated when compared to the original feature space, thus improving the performance. Furthermore, we computed t-SNE distribution for AntiTB_MD (Fig. 7) and observed similar trends with the AntiTB_RD results.

Fig. 6

Fig. 7

t-SNE visualization of the AntiTb_MD in a two-dimensional feature space. The dark cyan and salmon circles respectively represent AtbPs and non-AtbPs. (A) CTF, (B) DPC, (C) GTPC, (D) NC5, (E) QSO, and (F) the nine-dimensional probabilistic features (PF).

t-SNE visualization of the AntiTb_RD in a two-dimensional feature space. The dark cyan circles and salmon circles represent AtbPs and non-AtbPs, respectively. (A) CTF, (B) DPC, (C) AAI, (D) AAC, (E) QSO, and (F) nine-dimensional probabilistic features (PF). t-SNE visualization of the AntiTb_MD in a two-dimensional feature space. The dark cyan and salmon circles respectively represent AtbPs and non-AtbPs. (A) CTF, (B) DPC, (C) GTPC, (D) NC5, (E) QSO, and (F) the nine-dimensional probabilistic features (PF).

Implementation of a Webserver

As mentioned in [57] and suggested in a series of publications [[58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68]] highlighting the importance in the development of webservers, we established a user-friendly webserver, AtbPpred (http:/thegleelab.org/AtbPpred), which is aimed at reaching a wide research community. To validate our findings, all data sets utilized in this study can be freely downloaded from our web server. Below, we provide a simple three-step guideline in the utility of our webserver to obtain final predicted outcomes. In the first step, users can select any one of the two prediction models. In the second step, submit the query sequences in the input query box. Note that the input sequences should be in FASTA format. Examples of FASTA-formatted sequences can be seen below the input box. In the final step, the ‘Submit’ button is clicked to provide the prediction results as the output. For each run, users can submit a maximum of 3000 peptides for a single run. Moreover, we scanned the entire APD database [69] and AMPfun dataset [70] and built a list of potential anti-tubercular peptides, which is available in our webserver (http://thegleelab.org/AtbPpred/AtbPData.html).

Conclusion

In this study, we developed a novel sequence-based two-layer predictor called AtbPpred for the identification of AtbPs from the provided peptide sequences. In this predictor, the optimal feature set was identified individually from nine different feature encodings and developed their corresponding prediction models in the first layer. Subsequently, all these models predicted scores were further considered as features and developed the final prediction model in the second layer. Unlike the previous method [7], AtbPpred integrates different aspects of sequence information through nine prediction models, thereby overcoming each model limitations and generating more stable predictions. Hence, AtbPpred showed consistent performance using both training and independent datasets, demonstrating the practicability and benefits of our proposed method. Two explanations could shed light on the robustness of our method. Firstly, we utilized ERT classifier for training. When compared to other ML algorithms, ERT showed better performance. Secondly, the probabilistic features used in the second layer can more effectively distinguish AtbPs and non-AtbPs in feature space, when compared to the feature encodings used in the first layer. Besides AtbP prediction, our proposed framework could be further extended to other peptide sequence-based predictors and applied to diverse computational biology fields [[71], [72], [73]]. Furthermore, our proposed method along with the increasing availability of experimentally verified data and novel features will greatly improve the prediction of AtbPs. To enable its wide use in the research community, we made AtbPpred available as a user-friendly public web server. AtbPpred is expected to be a valuable tool in the identification of hypothetical AtbPs in a high-throughput and cost-effective manner, further enabling the characterization of their functional mechanisms.

Funding

This work was supported by the National Natural Science Foundation of China (No. 61701340), the Basic Science Research Program through the National Research Foundation (NRF) of Korea funded by the Ministry of Education, Science, and Technology [2018R1D1A1B07049572 and 2018R1D1A1B07049494] and ICT & Future Planning [2016M3C7A1904392].

Declarations of Competing Interests

None declared.

68 in total

1. AAindex: amino acid index database.

Authors: S Kawashima; M Kanehisa
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Predicting protein-protein interactions based only on sequences information.

Authors: Juwen Shen; Jian Zhang; Xiaomin Luo; Weiliang Zhu; Kunqian Yu; Kaixian Chen; Yixue Li; Hualiang Jiang
Journal: Proc Natl Acad Sci U S A Date: 2007-03-05 Impact factor: 11.205

3. i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome.

Authors: Wei Chen; Hao Lv; Fulei Nie; Hao Lin
Journal: Bioinformatics Date: 2019-08-15 Impact factor: 6.937

4. Bastion6: a bioinformatics approach for accurate prediction of type VI secreted effectors.

Authors: Jiawei Wang; Bingjiao Yang; André Leier; Tatiana T Marquez-Lago; Morihiro Hayashida; Andrea Rocker; Yanju Zhang; Tatsuya Akutsu; Kuo-Chen Chou; Richard A Strugnell; Jiangning Song; Trevor Lithgow
Journal: Bioinformatics Date: 2018-08-01 Impact factor: 6.937

Review 5. Anti-mycobacterial peptides: from human to phage.

Authors: Tieshan Teng; Jiafa Liu; Hongping Wei
Journal: Cell Physiol Biochem Date: 2015-01-15

6. SeqTU: A Web Server for Identification of Bacterial Transcription Units.

Authors: Xin Chen; Wen-Chi Chou; Qin Ma; Ying Xu
Journal: Sci Rep Date: 2017-03-07 Impact factor: 4.379

7. Prediction of Cell-Penetrating Potential of Modified Peptides Containing Natural and Chemically Modified Residues.

Authors: Vinod Kumar; Piyush Agrawal; Rajesh Kumar; Sherry Bhalla; Salman Sadullah Usmani; Grish C Varshney; Gajendra P S Raghava
Journal: Front Microbiol Date: 2018-04-12 Impact factor: 5.640

8. A Web Resource for Designing Subunit Vaccine Against Major Pathogenic Species of Bacteria.

Authors: Gandharva Nagpal; Salman Sadullah Usmani; Gajendra P S Raghava
Journal: Front Immunol Date: 2018-10-02 Impact factor: 7.561

9. M6APred-EL: A Sequence-Based Predictor for Identifying N6-methyladenosine Sites Using Ensemble Learning.

Authors: Leyi Wei; Huangrong Chen; Ran Su
Journal: Mol Ther Nucleic Acids Date: 2018-07-09 Impact factor: 8.886

10. PIP-EL: A New Ensemble Learning Method for Improved Proinflammatory Peptide Predictions.

Authors: Balachandran Manavalan; Tae Hwan Shin; Myeong Ok Kim; Gwang Lee
Journal: Front Immunol Date: 2018-07-31 Impact factor: 7.561

21 in total

1. Computational prediction of species-specific yeast DNA replication origin via iterative feature representation.

Authors: Balachandran Manavalan; Shaherin Basith; Tae Hwan Shin; Gwang Lee
Journal: Brief Bioinform Date: 2021-07-20 Impact factor: 11.622

2. Comparative analysis of machine learning algorithms on the microbial strain-specific AMP prediction.

Authors: Boris Vishnepolsky; Maya Grigolava; Grigol Managadze; Andrei Gabrielian; Alex Rosenthal; Darrell E Hurt; Michael Tartakovsky; Malak Pirtskhalava
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994

3. MPMABP: A CNN and Bi-LSTM-Based Method for Predicting Multi-Activities of Bioactive Peptides.

Authors: You Li; Xueyong Li; Yuewu Liu; Yuhua Yao; Guohua Huang
Journal: Pharmaceuticals (Basel) Date: 2022-06-03

4. iTTCA-MFF: identifying tumor T cell antigens based on multiple feature fusion.

Authors: Hongliang Zou; Fan Yang; Zhijian Yin
Journal: Immunogenetics Date: 2022-03-05 Impact factor: 3.330

5. Extremely-randomized-tree-based Prediction of N⁶-Methyladenosine Sites in Saccharomyces cerevisiae.

Authors: Rajiv G Govindaraj; Sathiyamoorthy Subramaniyam; Balachandran Manavalan
Journal: Curr Genomics Date: 2020-01 Impact factor: 2.236

6. i6mA-Fuse: improved and robust prediction of DNA 6 mA sites in the Rosaceae genome by fusing multiple feature representation.

Authors: Md Mehedi Hasan; Balachandran Manavalan; Watshara Shoombuatong; Mst Shamima Khatun; Hiroyuki Kurata
Journal: Plant Mol Biol Date: 2020-03-05 Impact factor: 4.076

7. Mounting Behaviour Recognition for Pigs Based on Deep Learning.

Authors: Dan Li; Yifei Chen; Kaifeng Zhang; Zhenbo Li
Journal: Sensors (Basel) Date: 2019-11-12 Impact factor: 3.576

8. i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes.

Authors: Md Mehedi Hasan; Balachandran Manavalan; Watshara Shoombuatong; Mst Shamima Khatun; Hiroyuki Kurata
Journal: Comput Struct Biotechnol J Date: 2020-04-08 Impact factor: 7.271

9. IRC-Fuse: improved and robust prediction of redox-sensitive cysteine by fusing of multiple feature representations.

Authors: Md Mehedi Hasan; Md Ashad Alam; Watshara Shoombuatong; Hiroyuki Kurata
Journal: J Comput Aided Mol Des Date: 2021-01-04 Impact factor: 3.686

10. Early Diagnosis of Hepatocellular Carcinoma Using Machine Learning Method.

Authors: Zi-Mei Zhang; Jiu-Xin Tan; Fang Wang; Fu-Ying Dao; Zhao-Yue Zhang; Hao Lin
Journal: Front Bioeng Biotechnol Date: 2020-03-27