Literature DB >> 36046193

Computational prediction and interpretation of druggable proteins using a stacked ensemble-learning framework.

Phasit Charoenkwan¹, Nalini Schaduangrat², Pietro Lio'³, Mohammad Ali Moni⁴, Watshara Shoombuatong², Balachandran Manavalan⁵.

Abstract

Discovery of potential drugs requires rapid and precise identification of drug targets. Although traditional experimental methodologies can accurately identify drug targets, they are time-consuming and inappropriate for high-throughput screening. Computational approaches based on machine learning (ML) algorithms can expedite the prediction of druggable proteins; however, the performance of the existing computational methods remains unsatisfactory. This study proposes a computational tool, SPIDER, to enhance the accurate prediction of druggable proteins. SPIDER employs various feature descriptors pertaining to several aspects, including physicochemical properties, compositional information, and composition-transition-distribution information, coupled with well-known ML algorithms to facilitate the construction of the final meta-predictor. The experimental results showed that SPIDER enabled more precise and robust prediction of druggable proteins than the baseline models and current existing methods in terms of the independent test dataset. An online web server was established and made freely available online.

Entities: Chemical

Keywords: Artificial intelligence; Artificial intelligence applications; Computational chemistry; Drugs

Year: 2022 PMID： 36046193 PMCID： PMC9421381 DOI： 10.1016/j.isci.2022.104883

Source DB: PubMed Journal: iScience ISSN： 2589-0042

Introduction

A druggable protein refers to a protein that can bind to small drug-like molecules with a high affinity and produce desirable therapeutic effects (Liu and Altman, 2014). Druggable proteins are usually members of large protein families that have been successfully identified as drug targets (Owens, 2007). Failure of projects in the drug discovery field is usually due to the target being undruggable, as estimated in approximately 60% of all cases (Sakharkar et al., 2007). As such, the druggability of a protein is crucial for the progression of a drug discovery project, wherein the accurate identification of drug targets is necessary (Overington et al., 2006). Experimental methods require the analysis of the three-dimensional structure of a protein, which results in a long development cycle (Sakharkar et al., 2007). Traditional experimental methods can precisely identify the drug targets; however, these methods are laborious and challenging for high-throughput applications. Computational methods based solely on the primary sequences of drugs can complement experimental methods to expedite the characterization and prediction of druggable proteins. Owing to the vast number of novel proteins generated via next-generation sequencing, the possibility of identifying candidate druggable proteins that have not yet been characterized is immense. Hence, the precise and quick identification of druggable proteins from a vast pool of sequenced proteins is highly desirable for the development of new drugs (Lindsay, 2005). Drug target prediction is complemented by numerous computational tools. For instance, Dezső and Ceccarelli developed a random forest (RF)-based method for selecting and prioritizing drug targets. In their study, the predictive model was trained using different feature descriptors and achieved an area under the receiver operating curve (AUC) of 0.89 in terms of the independent test dataset (Dezső and Ceccarelli, 2020). In addition, existing data-driven approaches can predict the drug similarity (Ma’ayan et al., 2014), drug-target interactions (Fakhraei et al., 2014; Perlman et al., 2011), and similarities between drugs and potential predicted targets (Wang et al., 2013). Detailed information on these data-driven approaches is available in the articles by Dezső and Ceccarelli (2020) and Gong et al. (2021). Several computational methods based on machine learning (ML) techniques, such as DrugMiner (Jamali et al., 2016), Sun’s method (Sun et al., 2018), GA-bagging-SVM (Lin et al., 2019), DrugHybrid_BS (Gong et al., 2021), Yu’s method (Yu et al., 2022), and XGB-DrugPred (Sikander et al., 2022), have been designed for the in silico prediction of druggable proteins based on their protein sequence information, as summarized in Table 1.

Table 1

Summary of existing methods and tools for prediction of druggable proteins

Method (Year)	Classifiera	Featuresb	Evaluation strategyc	Web server availability
DrugMiner (2016) (Jamali et al., 2016)	NN	AAC, DPC, PCP	5CV	Yes
Sun’s method (2018) (Sun et al., 2018)	NN	CTD	5CV/IND	No
GA-Bagging-SVM (2019) (Lin et al., 2019)	SVM	DPC, RC, PAAC	5CV	No
DrugHybrid_BS (2021) (Gong et al., 2021)	SVM	CC, GAAC, monoDIKgap	5CV	No
Yu’s method (2022) (Yu et al., 2022)	CNN-RNN	Dictionary, DPC, TPC, CTD	5CV/IND	No
XGB-;DrugPred (2022) (Sikander et al., 2022)	XGB	GDPC, S-PseAAC, RAAA	10CV	No
SPIDER (This study)	SVM	AAC, APAAC, DPC, CTD, PAAC, RC	10CV/IND	Yes

CNN-RNN: hybrid model integrating convolutional recurrent neural networks and deep neural networks, NN: neural networks, SVM: support vector machine.

AAC: amino acid composition, APAAC: amphiphilic pseudo-amino acid composition, CC: Cross Covariance, CTD: Composition-Transition-Distribution, DPC: dipeptide composition, DPS: dipeptide propensity score; GAAC: grouped amino acid composition, PCP: physicochemical properties, PACC: pseudo amino acid composition, TPC: tripeptide composition.

5CV: 5-fold cross-validation test, 10CV: 10-fold cross-validation test IND: independent test.

Summary of existing methods and tools for prediction of druggable proteins CNN-RNN: hybrid model integrating convolutional recurrent neural networks and deep neural networks, NN: neural networks, SVM: support vector machine. AAC: amino acid composition, APAAC: amphiphilic pseudo-amino acid composition, CC: Cross Covariance, CTD: Composition-Transition-Distribution, DPC: dipeptide composition, DPS: dipeptide propensity score; GAAC: grouped amino acid composition, PCP: physicochemical properties, PACC: pseudo amino acid composition, TPC: tripeptide composition. 5CV: 5-fold cross-validation test, 10CV: 10-fold cross-validation test IND: independent test. In 2016, Jamali et al. developed DrugMiner (Jamali et al., 2016), the first computational method in this field, based on their own dataset comprising 1224 druggable and 1319 non-druggable proteins. DrugMiner was created using a neural network algorithm in conjunction with various types of feature descriptors. Furthermore, Lin et al. created a GA-bagging-SVM (Lin et al., 2019) by integrating various support vector machine (SVM)-based classifiers and a genetic algorithm (GA) through the bagging ensemble learning strategy. In the GA-bagging-SVM, Lin et al. employed three feature encodings to represent the druggable proteins, dipeptide composition (DPC), pseudo amino acid composition (PAAC), and reduced sequences (RS), which encompass the secondary structure, DHP, acidity, polarity, and charge (referred to herein as RSsecond, RSDHP, RSacid, RSpolar, and RScharge, respectively). Recently, Gong et al. (2021) developed DrugHybrid_BS, a bagging ensemble learning model combined with monoDiKGap, cross-covariance, and grouped amino acid composition. DrugHybrid_BS can provide a reasonably high predictive performance with an accuracy (ACC) of 0.970 and an AUC of 0.992. Recently, Yu et al. (2022) created hybrid convolutional recurrent neural networks (CNN-RNNs), which utilized both dictionary and sequence encoding schemes to enhance the prediction performance. Yu et al. first established an independent test dataset, which contained 224 druggable and 237 non-druggable proteins. This method provided an ACC of 0.898 and a Matthew’s correlation coefficient (MCC) of 0.799 for the independent test dataset. All aforementioned methods have facilitated the identification of druggable proteins and promoted the progress in this field. However, certain concerns still need to be addressed. First, most of the existing methods, except Yu’s method (Yu et al., 2022), were not performed on an independent test dataset; thus, their prediction performance may fail in terms of generalizability. Second, there is no comprehensive analysis or evaluation of conventional feature encodings and ML algorithms for druggable proteins. Third, all existing methods are considered as black-box models; as such, it is difficult to provide a straightforward interpretation of the functional mechanisms of druggable proteins. Finally, all existing methods, except that of DrugMiner (Jamali et al., 2016), were not deployed as web servers. Therefore, they can only be used by experimental scientists. Considering the above-mentioned limitations, herein, a new computational tool, named SPIDER (Stacked PredIctor of DruggablE pRoteins), is presented to improve the prediction accuracy of druggable proteins and enhance the most important features contributing to druggable protein prediction (see Figure 1). The significance and major advantages of SPIDER are as follows: (i) SPIDER represents the first stacked ensemble learning approach proposed for druggable protein prediction. Specifically, SPIDER was trained and constructed by integrating m = 10 selected ML classifiers to facilitate the stable and accurate prediction of druggable proteins; (ii) we comprehensively investigated and assessed the predictive ability of various types of feature encodings coupled with popular ML algorithms in the prediction of druggable proteins. SPIDER was found to be more effective and outperformed several ML classifiers and existing methods for this prediction problem in terms of an independent test dataset; (iii) we employed an interpretable Shapley Additive exPlanations (SHAP) method to shed light on the impact of each feature on the output of SPIDER. Finally, to facilitate community-wide efforts in the prediction of druggable proteins, an online web server based on SPIDER was created and is easily accessible at http://pmlabstack.pythonanywhere.com/SPIDER.

Figure 1

Schematic flowchart of the development of the SPIDER

There are four major steps, including dataset construction, feature engineering, new feature generation, and meta-predictor development.

Schematic flowchart of the development of the SPIDER There are four major steps, including dataset construction, feature engineering, new feature generation, and meta-predictor development.

Results and discussion

Performance of different feature-encoding schemes and ML algorithms

In this study, we comprehensively analyzed and assessed the predictive ability of various baseline models trained with ten feature encodings and six ML algorithms to distinguish druggable proteins from non-druggable ones. Comprehensive information regarding the feature encodings and ML algorithms is presented in Table 2 and S1. Both 10-fold cross-validation and independent tests were implemented on the training and independent test datasets to assess the performance of each baseline model, as summarized in Figure 2 and Tables S2–S5. As described in the SPIDER framework section, the baseline model with the highest MCC in the training dataset is regarded as the most efficient.

Table 2

Summary of ten different sequence-based feature descriptors along with their corresponding description and dimension

Order	Descriptors	Description	Dimension	Reference
1	AAC	Frequency of 20 amino acids	20	(Charoenkwan et al., 2021b, 2022)
2	APAAC	Amphiphilic pseudo-amino acid composition	22	(Chou, 2001, 2005)
3	CTD	Composition, transition, and distribution	273	(Li et al., 2006)
4	DPC	Frequency of 400 dipeptides	400	(Chen et al., 2016; Lin and Chen, 2011)
5	PAAC	Pseudo amino acid composition	21	(Chou, 2001, 2005)
6	RSacid	Reduced amino acid sequences according to acidity	32	(Xu et al., 2017)
7	RScharge	Reduced amino acid sequences according to charge	50	(Xu et al., 2017)
8	RSDHP	Reduced amino acid sequences according to DHP	32	(Xu et al., 2017)
9	RSpolar	Reduced amino acid sequences according to polarity	32	(Xu et al., 2017)
10	RSsecond	Reduced amino acid sequences according to secondary structure	40	(Xu et al., 2017)

Figure 2

Performance evaluations of top 30 baseline models

(A and B) Cross-validation ACC and MCC of top 30 baseline models.

(C and D) Independent test ACC and MCC of top 30 baseline models.

Summary of ten different sequence-based feature descriptors along with their corresponding description and dimension Performance evaluations of top 30 baseline models (A and B) Cross-validation ACC and MCC of top 30 baseline models. (C and D) Independent test ACC and MCC of top 30 baseline models. To analyze the overall effect of each feature encoding on the prediction of druggable proteins, we computed the average cross-validation performance for each feature encoding over six different ML algorithms. Among the ten feature encodings, the top five feature encodings comprising the highest performance corresponded to RSpolar, RSDHP, RSsecond, RSacid, and RScharge, with average MCC values of 0.727, 0.721, 0.716, 0.713, and 0.712, respectively (Table S4). Similarly, the top-three feature ML algorithms with the highest performance contained SVM, RF, and extremely randomized trees (ET), with corresponding average MCC values of 0.762, 0.747, and 0.743, respectively (Table S5). Interestingly, the SVM, RF, and ET models developed using RSpolar, RScharge, and RSpolar achieved the highest MCC values of 0.796, 0.778, and 0.769, respectively. This observation indicates that the feature group of RS could be more beneficial for druggable protein prediction. Among the 60 baseline models, the highest MCC of 0.796 was achieved by the SVM-RSpolar model, while the second and third highest MCC values of 0.792 and 0.780 were achieved by the SVM-RSDHP and SVM-Rsecond models, respectively. This indicated that the SVM-RSpolar model could be considered the most efficient one for the prediction of druggable proteins. Regarding the independent test results, the best-performing model provided an MCC of 0.770, an ACC of 0.883, and an AUC of 0.936. In contrast, the highest independent MCC of 0.808 was obtained using the PLS-AAC model. Overall, our comprehensive analysis suggests that single-feature-based models might fail in terms of generalizability and stability in this prediction problem. As such, we applied a stacked ensemble learning methodology to generate a model with the strongest stability and generalization ability in terms of both cross-validation and independent tests.

Construction of SPIDER

Next, we developed an ensemble model that integrates several ML classifiers using the stacking approach. To this end, we employed both 60-D and m-D feature vectors to develop mSVM predictors. As described in the SPIDER framework section, the GA in combination with the self-assessment-report (GA-SAR) approach was employed to optimize the 60-D feature vector by selecting m informative probabilistic features (PFs). After applying the GA-SAR approach, the experimental results indicated that the best number of informative PFs was m = 10. Specifically, m = 10 informative PFs were derived from ten baseline models, namely SVM-AAC, LR-DPC, ET-CTD, RF-PAAC, LR-APAAC, SVM-RSacid, SVM-RSpolar, LR-RSsecond, PLS-RScharge, and ET-RSDHP. Table 3 provides information pertaining to the performance evaluation of the two new feature vectors. As shown in Table 3, the 10-D feature vector (referred to herein as the optimal feature vector) was found to provide an enhancement, as judged by ACC, MCC, sensitivity (Sn), and specificity (Sp), not only in the training dataset but also in the independent dataset. Remarkably, the ACC, MCC, and Sn of the optimal feature vector in the independent test dataset were 0.907, 0.816, and 0.857, respectively, which were 1.74%, 3.37%, and 2.68% higher than the compared feature vectors, respectively. For convenience of discussion, we denote the mSVM predictor trained with the optimal feature vector as SPIDER. We also compared the performance of SPIDER with that of a popular ensemble approach (voting strategy) in the independent test dataset. As shown in Table S6, the ACC, MCC, Sn, and Sp of SPIDER outperformed the voting strategy by 1.52%, 2.97%, 2.23%, and 0.84%, respectively.

Table 3

Cross-validation results for the control and optimal model

Evaluation strategy	Model	Number of feature	ACC	Sn	Sp	MCC	AUC
Cross-validation	Control	60	0.909	0.888	0.929	0.819	0.955
Cross-validation	Optimal	10	0.919	0.895	0.942	0.839	0.950
Independent test	Control	60	0.889	0.830	0.945	0.783	0.934
Independent test	Optimal	10	0.907	0.857	0.954	0.816	0.902

Cross-validation results for the control and optimal model

Performance comparison between SPIDER and single-feature-based models

To elucidate the advantage of the stacking approach, we compared the performance of SPIDER with that of single-feature-based models. Figure 2 shows the five top-ranked baseline models, as indicated by MCC—SVM-RSpolar, SVM-RSDHP, SVM-RSsecond, SVM-AAC, and RF-RScharge—with corresponding MCC values of 0.796, 0.792, 0.780, 0.779, and 0.778, respectively. Thus, the performance of SPIDER was evaluated against the top five baseline models. The performance results are summarized in Figure 3 and Table 4. As summarized in Table 4, SPIDER clearly outperformed the top five baseline models in the training dataset and the independent dataset in terms of most of the performance metrics, with the exception of AUC. Specifically, SPIDER achieved enhanced performance in comparison to the best-performing baseline model (SVM-RSpolar) in the independent dataset in terms of ACC (0.907 vs. 0.883), Sn (0.857 vs. 0.821), Sp (0.954 vs. 0.941), and MCC (0.816 vs. 0.770). The above-mentioned results demonstrate that SPIDER achieved improved performance and stability compared with several single-feature-based models in both the training and independent test datasets.

Figure 3

Predictive performance of various models

Performance comparison of SPIDER with the top five baseline models on the training (A–B) and independent test (C–D) datasets. Prediction results of SPIDER and the top five baseline models in terms of ACC, Sn, Sp, and MCC (A, C). ROC curves and AUC values of SPIDER and the top five baseline models (B–D).

Table 4

Performance comparison of SPIDER and top five baseline models on the training and independent test datasets

Evaluation strategy	Method	ACC	Sn	Sp	MCC	AUC
Cross-validation	SPIDER	0.919	0.895	0.942	0.839	0.950
	SVM-RSpolar	0.898	0.885	0.911	0.796	0.960
	SVM-RSDHP	0.896	0.884	0.908	0.792	0.960
	SVM-RSsecond	0.890	0.885	0.895	0.780	0.957
	SVM-AAC	0.890	0.882	0.897	0.779	0.956
	RF-Scharge	0.889	0.862	0.914	0.778	0.943
Independent test	SPIDER	0.907	0.857	0.954	0.816	0.902
	SVM-RSpolar	0.883	0.821	0.941	0.770	0.936
	SVM-RSDHP	0.879	0.821	0.932	0.760	0.937
	SVM-RSsecond	0.889	0.844	0.932	0.781	0.944
	SVM-AAC	0.889	0.835	0.941	0.782	0.944
	RF-Scharge	0.868	0.786	0.945	0.743	0.928

Predictive performance of various models Performance comparison of SPIDER with the top five baseline models on the training (A–B) and independent test (C–D) datasets. Prediction results of SPIDER and the top five baseline models in terms of ACC, Sn, Sp, and MCC (A, C). ROC curves and AUC values of SPIDER and the top five baseline models (B–D). Performance comparison of SPIDER and top five baseline models on the training and independent test datasets To further reveal the improved performance of SPIDER, the distribution of the 2D feature space from the top three informative feature descriptors (RSpolar, RSDHP, and RSsecond), all features, the 60-D feature vector, and the optimal feature vector were visualized using the t-distributed stochastic neighbor embedding (t-SNE) (Van Der Maaten, 2014; Van der Maaten and Hinton, 2008) framework, wherein the red and green dots indicate druggable and non-druggable proteins, respectively (Figure 4). As shown in Figures 4A–4D, the red and green dots derived from the four t-SNE plots were mixed together, indicating that these feature descriptors have limited discriminative power for identifying druggable proteins. However, we noticed that the 60-D and optimal feature vectors showed a sharp distinction between the distribution of red and green dots (Figure 4F). Altogether, the stacking strategy used in SPIDER seems to be an effective and useful approach for improving prediction performance and model generalizability.

Figure 4

t-distributed stochastic neighbor embedding (t-SNE) distribution of positive and negative samples on the training dataset

(A) RSpolar, (B) RSDHP, (C) RSsecond, (D) All features, (E) 60-D feature vector, and (F) Optimal feature vector.

t-distributed stochastic neighbor embedding (t-SNE) distribution of positive and negative samples on the training dataset (A) RSpolar, (B) RSDHP, (C) RSsecond, (D) All features, (E) 60-D feature vector, and (F) Optimal feature vector.

Performance comparison between SPIDER and state-of-the-art methods

Here, the performance of SPIDER was compared with that of state-of-the-art methods. Table 1 provides details of various ML-based methods that have been designed based on sequence information, namely DrugMiner (Jamali et al., 2016), Sun’s method (Sun et al., 2018), GA-Bagging-SVM (Lin et al., 2019), DrugHybrid_BS (Gong et al., 2021), Yu’s method (Yu et al., 2022), and XGB-DrugPred (Sikander et al., 2022), to determine the druggability of proteins. Among these existing methods, Yu’s method (Yu et al., 2022) was the only one that was evaluated on both the training (1,224 druggable and 1,319 non-druggable proteins) and independent test (224 druggable and 237 non-druggable proteins) datasets. To perform a comprehensive comparison, we compared the performance of SPIDER with that of Yu’s method. The results of the comparison of the two methods are listed in Table 5. As can be observed, SPIDER attained the highest performance in terms of ACC, MCC, Sn, and F-values on the training dataset, which were 1.94%, 3.92%, 0.53%, and 1.80% higher than those obtained using Yu’s method, respectively. Furthermore, the independent test results demonstrated that SPIDER achieved better performance, achieving an ACC of 0.907, Sn of 0.857, MCC of 0.816, and F-value of 0.899. Taken together, these results demonstrate that SPIDER is an accurate prediction model with efficient generalization ability compared with the available methods.

Table 5

Performance comparison of SPIDER and the state-of-the-art method

Evaluation strategy	Method	ACC	Sn	MCC	F-score	PRE
Cross-validation	Yu’s methoda	0.900	0.890	0.800	0.896	0.905
Cross-validation	SPIDER	0.919	0.895	0.839	0.914	0.895
Independent test	Yu’s methoda	0.898	0.848	0.799	0.889	0.936
Independent test	SPIDER	0.907	0.857	0.816	0.899	0.857

Results were reported from the work of Yu’s method (Yu et al., 2022).

Performance comparison of SPIDER and the state-of-the-art method Results were reported from the work of Yu’s method (Yu et al., 2022).

Mechanistic interpretation of SPIDER

As mentioned above, we applied the GA-SAR algorithm to select m important features to generate the optimal feature vector. However, the relationship between these features remains unknown. To address this problem, we used the SHAP framework not only to assess the value of each feature but also to shed light on the output of the model, which plays a crucial role in many bioinformatic applications (Li et al., 2021a, 2021b). As previously stated, SPIDER was constructed using a combination of ten selected baseline models—SVM-AAC, LR-DPC, ET-CTD, RF-PAAC, LR-APAAC, SVM-RSacid, SVM-RSpolar, LR-RSsecond, PLS-RScharge, and ET-RSDHP. SHAP positive and negative values indicate the prediction of druggable and non-druggable proteins, respectively. As illustrated in Figure 5, the top five informative features with the highest SHAP values were SVM-RSpolar, LR-DPC, LR-RSsecond, SVM-AAC, and PLS-RScharge. Interestingly, most of the top five informative features (except PLS-RScharge) had positive SHAP values. Taking SVM-RSpolar as an example, for a given uncharacterized protein sequence P, if the value of SVM-RSpolar is very high, then P is predicted as a druggable protein; otherwise, P is predicted as a non-druggable protein.

Figure 5

Ten important features of SPIDER ranked by SHAP values

SHAP values represent the directionality of the ten important features, where positive and negative SHAP values represent druggable protein and non-druggable protein predictions, respectively.

Ten important features of SPIDER ranked by SHAP values SHAP values represent the directionality of the ten important features, where positive and negative SHAP values represent druggable protein and non-druggable protein predictions, respectively. To further reveal the influence of the optimal feature vector on the functioning of SPIDER, the performance of SPIDER was compared with that of a model lacking the optimal feature vector. Detailed comparison results of the two models on the training and independent test datasets are presented in Figure 6 and Table S7. The comparison outcomes clearly indicated that SPIDER achieved an overall better performance than the compared model with regard to all performance metrics, with the exception of AUC. Specifically, the ACC, Sn, Sp, and MCC of SPIDER in the independent test dataset were 2.60%, 4.02%, 1.27%, and 5.05% higher, respectively, than those of the model lacking the optimal feature vector. These results demonstrate that m = 10 selected informative PFs are vital in capturing the key information pertaining to druggable proteins, thus contributing to the improvement in performance.

Figure 6

Predictive performance of various models

Performance comparison of SPIDER with the model without the optimal feature vector, as assessed by 10-fold cross-validation (A) and independent test (B).

Predictive performance of various models Performance comparison of SPIDER with the model without the optimal feature vector, as assessed by 10-fold cross-validation (A) and independent test (B).

Utilization of the SPIDER webserver

Publicly accessible web servers are more beneficial for experimental researchers to identify their desired samples rather than developing their own internal prediction models. Therefore, we developed an online webserver for SPIDER, which is freely available at http://pmlabstack.pythonanywhere.com/SPIDER, to aid the broader research community in the identification of druggable protein candidates from large-scale proteins. In addition, we have provided stepwise instructions on the usage of the SPIDER webserver, which can be accessed using the “About” tab of the webserver.

Conclusion

This study presents SPIDER, an innovative stacked ensemble learning framework established for the precise prediction of druggable proteins. In SPIDER, we utilized ten distinctive feature descriptors based on various features, including physicochemical properties, composition-transition-distribution information, and compositional information. These feature descriptors, in conjunction with popular ML algorithms, have been used to develop numerous baseline models. Ultimately, m = 10 selected baseline models derived from the GA-SAR method were integrated to create the final meta-predictor in this study. Comparative experimental results showed that SPIDER was more efficient for druggable protein predictions compared to its baseline models in terms of cross-validation and independent tests. Moreover, SPIDER achieved better performance than the existing method, Yu’s method, with an ACC of 0.907, Sn of 0.857, and MCC of 0.816, in terms of the independent test dataset. In addition, the SHAP algorithm was applied to determine the impact of each baseline model on the output provided by SPIDER. Finally, to aid highly efficient prediction of druggable proteins, we created an accessible webserver based on SPIDER that is readily available at http://pmlabstack.pythonanywhere.com/SPIDER. We believe that SPIDER will be a useful tool for the screening and identification of potential druggable proteins and to expedite their application in the drug discovery and development process.

Limitations of the study

Overall, the computational tool proposed in this study could enable a more precise and robust prediction of druggable proteins as compared to the current existing methods. In the meanwhile, we employed the SHAP approach to elucidate the effect of each feature on the prediction of druggable proteins. However, there is still ample room for improving the prediction performance. Recently, several computational frameworks have been developed and reported, such as a flexible deep learning (DL)-based approach (Liang et al., 2022), DL-based hybrid approach (Hasan et al., 2022; Xie et al., 2021), and multilayer ensemble learning frameworks (Shoombuatong et al., 2022). In consideration of the effectiveness of these frameworks, in the future, we plan on integrating the appropriate computational methodologies for further enhancement of the prediction performance of druggable proteins.

STAR★Methods

Key resources table

Resource availability

Lead contact

Further information regarding the methods and the dataset should be directed to and will be fulfilled by the lead contact, Professor Balachandran Manavalan (bala2022@skku.edu).

Materials availability

This study did not generate new reagents.

Method details

Benchmark datasets

We used the same training dataset derived from a study by Jamali et al. (2016) to generate and optimize our proposed models. The dataset comprised 1,224 druggable and 1,319 non-druggable proteins, representing positive and negative samples, respectively. Specifically, compilation of the positive samples was performed using the DrugBank database (Law et al., 2014), while Swiss-Prot was employed for the negative samples using the methods described by Li et al. (Li and Lai, 2007) and Bakheet et al. (Bakheet and Doig, 2009). Yu et al. (2022) recently established the first independent test dataset for this prediction problem. This independent test dataset contained 224 druggable and 237 non-druggable proteins. Additional details regarding the construction of the training and independent test datasets can be found in the articles by Jamali et al. (2016) and Yu et al. (2022), respectively. These datasets are available at http://pmlabstack.pythonanywhere.com/dataset_SPIDER.

Feature engineering

To obtain key information on druggable proteins, we utilized ten different feature-encoding schemes based on sequence information, namely PAAC, DPC, RSsecond, RSDHP, RSacid, RSpolar, RScharge, amino acid composition (AAC), amphiphilic pseudo-amino acid composition (APAAC), and composition-transition-distribution (CTD), indicating different aspects, including physicochemical properties, composition-transition-distribution information, and compositional information. The ten sequence-based feature encodings are sorted into three key groups as follows: (i) the first group consists of CTD-based features (Charoenkwan et al., 2021b, 2022); (ii) the second group consists of composition-based features (AAC and DPC (Rao et al., 2020; Wei et al., 2018)); and (iii) the third group consists of physicochemical property-based features (APAAC, PAAC, RSsecond, RSDHP, RSacid, RSpolar, and RScharge (Lin et al., 2019; Xu et al., 2017)). In this study, the aforementioned feature encodings were retrieved using the iFeature package (Chen et al., 2018). Comprehensive information regarding the feature encoding is presented in Table 2.

Feature selection based on GA-SAR

To enhance the predictive capability of the proposed models, we employed the GA-SAR approach to determine the model parameters and optimize the selection of informative features. This method was initially introduced by our group for the prediction of quorum-sensing peptides (Charoenkwan et al., 2019). This feature selection method has been successfully used in several bioinformatic applications (Charoenkwan et al., 2020, 2021a, 2021c, 2022). In brief, GA-SAR creates a profile that is employed to assess the importance of a feature. Note that the most important feature shows the highest correlation between the feature and output variable (Charoenkwan et al., 2019; Ho et al., 2004). In the GA-SAR algorithm, the chromosome contains binary and parametric genes, which are created for two main purposes: feature selection and ML parameter optimization. For ease of discussion, Gene and Chrom were used to represent the genes and chromosomes, respectively. Features with increased frequency are deemed significant for the prediction of druggable proteins. The implementation of GA-SAR algorithm to identify important features involves the following steps: (i) Randomly create 50 Chroms containing assigned values of binary Genes as means to get the number of features (m) equal to the selected number; (ii) Evaluate the performance for each Chrom in terms of the 10-fold cross-validation test; (iii) Construct a mating pool by performing a tournament selection; (iv) Perform a 20-point crossover on the selected parents; (v) Identify the optimal feature set by employing the SAR mutation operator; and (vi) Stop if the termination condition is reached; otherwise go to Step (ii). Additional information regarding the GA-SAR approach is available in the article by Charoenkwan et al. (2019).

SPIDER framework

In this study, we employed the stacking approach to establish SPIDER to improve the prediction of druggable proteins. This approach represents an efficient learning technique based on the ensemble method, which incorporates the individual abilities of various ML classifiers to create a single stable model (Cao et al., 2018; Fu et al., 2020; Mishra et al., 2019; Wolpert, 1992). To date, various stacking-based computational approaches have achieved improved performance compared with their baseline models (Charoenkwan et al., 2021b, 2021c; Li et al., 2021a, 2021b; Qiang et al., 2020; Xie et al., 2021). In particular, the construction process of the SPIDER includes three major steps, as summarized in Figure 1. Briefly, several baseline models were created and used to generate PFs. Finally, informative PFs were selected and employed in meta-predictor construction. Further details of the SPIDER framework are provided in the following paragraphs. First, we created 60 baseline models developed using six different ML algorithms, SVM, RF, logistic regression (LR), k-nearest neighbor (KNN), ET, and partial least squares (PLS), in conjunction with ten widely used feature encodings, CTD, AAC, DPC, APAAC, PAAC, RSsecond, RSDHP, RSacid, RSpolar, and RScharge. We then systematically assessed the implementation of these six ML algorithms and ten feature encodings in the prediction of druggable proteins using the training and independent test datasets. Notably, the baseline model yielding the highest MCC in terms of the training dataset was deemed as the best-performing model. The Scikit-learn v0.24.1 package (Pedregosa et al., 2011) was utilized for the development and optimization of all baseline models, and the search range is documented in Table S1. As each baseline model provided probabilistic information, we used this as the second step. Specifically, this information is the prediction confidence that the implied protein is druggable. Herein, the predicted confidence was considered PF, where the value of PF ranged from 0 to 1. As a result, for a given protein sequence P, we obtained 60 PFs generated by all 60 baseline models, which can be defined as follows:where represents the PF generated by the baseline model trained using the i ML algorithm coupled with the j feature descriptor. Finally, P is converted into a 60-dimensional (60-D) feature vector. In the third step, we used the 60-D feature vector to construct the meta-predictor based on the SVM algorithm (mSVM) using the Scikit-learn v0.24.1 package. To improve the performance of the mSVM predictor, we employed a GA-SAR approach (Charoenkwan et al., 2019). This allowed us to determine m informative PFs from 60 PFs, where m is in the range from 5 to 20. Herein, Chrom consisted of n = 60 binary genes (bgi) to select m informative PFs (m < n) and 3-bit genes to optimize the parameters of the mSVM predictor (Table S1). If bg = 1, the i PF is used for the construction of the mSVM predictor; otherwise, the i PF is omitted from the optimal feature vector. Finally, the feature vector that achieves the highest cross-validation MCC is deemed to be the ideal one and is applied for the final meta-predictor construction.

Performance evaluation

Seven widely used performance metrics, MCC, ACC, AUC, F-value, precision (PRE), Sn, and Sp, were applied to the two-class prediction problem (Azadpour et al., 2014; Charoenkwan et al., 2021b) as follows: Specifically, TP and TN represent the numbers of correctly predicted true druggable and true non-druggable proteins, respectively. Furthermore, FP and FN indicate the number of non-druggable proteins that are predicted to be druggable proteins and the number of druggable proteins that are predicted to be non-druggable proteins, respectively (Dao et al., 2021a, 2021b; Lv et al., 2021a, 2021b; Wang et al., 2021).

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Software and algorithms

iFeature	(Chen et al., 2018)	https://github.com/Superzchen/iFeature/
Reduced Sequences	(Lin et al., 2019)	https://github.com/QUST-AIBBDRC/GA-Bagging-SVM
Python package Scikit-learn v0.24.1	(Pedregosa et al., 2011)	https://scikit-learn.org/stable/
SPIDER	This paper	https://github.com/plenoi/SPIDER

48 in total

1. Prediction of thermophilic proteins using feature selection technique.

Authors: Hao Lin; Wei Chen
Journal: J Microbiol Methods Date: 2010-10-31 Impact factor: 2.363

Review 2. How many drug targets are there?

Authors: John P Overington; Bissan Al-Lazikani; Andrew L Hopkins
Journal: Nat Rev Drug Discov Date: 2006-12 Impact factor: 84.694

3. Properties and identification of human protein drug targets.

Authors: Tala M Bakheet; Andrew J Doig
Journal: Bioinformatics Date: 2009-01-21 Impact factor: 6.937

4. Combining drug and gene similarity measures for drug-target elucidation.

Authors: Liat Perlman; Assaf Gottlieb; Nir Atias; Eytan Ruppin; Roded Sharan
Journal: J Comput Biol Date: 2011-02 Impact factor: 1.479

5. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences.

Authors: Zhen Chen; Pei Zhao; Fuyi Li; André Leier; Tatiana T Marquez-Lago; Yanan Wang; Geoffrey I Webb; A Ian Smith; Roger J Daly; Kuo-Chen Chou; Jiangning Song
Journal: Bioinformatics Date: 2018-07-15 Impact factor: 6.937

6. StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency.

Authors: Xiangzheng Fu; Lijun Cai; Xiangxiang Zeng; Quan Zou
Journal: Bioinformatics Date: 2020-05-01 Impact factor: 6.937

7. Lean Big Data integration in systems biology and systems pharmacology.

Authors: Avi Ma'ayan; Andrew D Rouillard; Neil R Clark; Zichen Wang; Qiaonan Duan; Yan Kou
Journal: Trends Pharmacol Sci Date: 2014-08-07 Impact factor: 14.819

8. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence.

Authors: Z R Li; H H Lin; L Y Han; L Jiang; X Chen; Y Z Chen
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

9. DrugBank 4.0: shedding new light on drug metabolism.

Authors: Vivian Law; Craig Knox; Yannick Djoumbou; Tim Jewison; An Chi Guo; Yifeng Liu; Adam Maciejewski; David Arndt; Michael Wilson; Vanessa Neveu; Alexandra Tang; Geraldine Gabriel; Carol Ly; Sakina Adamjee; Zerihun T Dame; Beomsoo Han; You Zhou; David S Wishart
Journal: Nucleic Acids Res Date: 2013-11-06 Impact factor: 16.971

10. Porpoise: a new approach for accurate prediction of RNA pseudouridine sites.

Authors: Fuyi Li; Xudong Guo; Peipei Jin; Jinxiang Chen; Dongxu Xiang; Jiangning Song; Lachlan J M Coin
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 13.994