Literature DB >> 25189131

Enhancing protein-vitamin binding residues prediction by multiple heterogeneous subspace SVMs ensemble.

Dong-Jun Yu1, Jun Hu, Hui Yan, Xi-Bei Yang, Jing-Yu Yang, Hong-Bin Shen.   

Abstract

BACKGROUND: Vitamins are typical ligands that play critical roles in various metabolic processes. The accurate identification of the vitamin-binding residues solely based on a protein sequence is of significant importance for the functional annotation of proteins, especially in the post-genomic era, when large volumes of protein sequences are accumulating quickly without being functionally annotated.
RESULTS: In this paper, a new predictor called TargetVita is designed and implemented for predicting protein-vitamin binding residues using protein sequences. In TargetVita, features derived from the position-specific scoring matrix (PSSM), predicted protein secondary structure, and vitamin binding propensity are combined to form the original feature space; then, several feature subspaces are selected by performing different feature selection methods. Finally, based on the selected feature subspaces, heterogeneous SVMs are trained and then ensembled for performing prediction.
CONCLUSIONS: The experimental results obtained with four separate vitamin-binding benchmark datasets demonstrate that the proposed TargetVita is superior to the state-of-the-art vitamin-specific predictor, and an average improvement of 10% in terms of the Matthews correlation coefficient (MCC) was achieved over independent validation tests. The TargetVita web server and the datasets used are freely available for academic use at http://csbio.njust.edu.cn/bioinf/TargetVita or http://www.csbio.sjtu.edu.cn/bioinf/TargetVita.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 25189131      PMCID: PMC4261549          DOI: 10.1186/1471-2105-15-297

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Functional positions in a protein are residues that play more critical roles than other residues and enable the protein to perform specific biological functions, such as capturing drugs [1], binding ligands [2], and interacting with other proteins [3]. However, functionally annotated proteins still account for only a small portion of sequenced proteins, and the gap between annotated and sequenced proteins is ever-increasing with the rapid development of advanced sequencing technology and concerted genome projects [4]. Automated computational methods for the prediction of protein functional positions are urgently needed and have become a hotspot in bioinformatics research. During the past decades, machine-learning-based computational methods have been extensively applied to various protein functional position prediction problems [5-7]. Protein-ligand interaction is one of the most important protein functions and plays vital roles in virtually all biological processes [2, 8, 9]. Considerable effort has been made to design effective methods for protein-ligand binding residue (site) prediction, and much progress has been made in this area [10]. In the early stage, general-purpose protein-ligand predictors, which predict ligand binding sites (pockets) regardless of ligand types, dominate in the fields of protein-ligand binding site prediction. Such predictors include LIGSITE [11], CASTp [12], SURFNET [13], POCKET [14], fpocket [15], Q-SiteFinder [16], and SITEHOUND [17]. Later, researchers observed that protein-ligand binding sites (pockets) vary significantly in their roles, sizes, and distributions for different types of protein-ligand interactions, and different ligands tend to bind diverse types of residues with prominent specificities [18, 19]. These observations motivated the emergence of ligand-specific predictors, which are specifically designed to predict binding residues or sites for certain ligand types, such as NsitePred [20] and TargetS [21] for protein-nucleotide binding prediction, FINDSITE-metal [22] and CHED [23] for protein-metal binding prediction, MetaDBSite [24] and DNABR [25] for protein-DNA binding prediction, protein-drug binding prediction [26], and others. These studies have shown that ligand-specific predictors are often superior to general-purpose predictors and are a promising route for improving the performance of protein-ligand prediction [20, 21]. Vitamins are typical ligands and play critical roles in various metabolic processes [27-29]. However, to the best of our knowledge, minimal work has been performed to design a specific predictor for predicting protein-vitamin binding residues. Recently, Panwar et al. [30] published their pioneering work on protein-vitamin binding prediction and a predictor, called VitaPred, was implemented. VitaPred [30] is a sequence-based ligand-specific predictor specifically designed for predicting protein-vitamin binding residues, and it consists of four independent prediction modules for predicting vitamin, vitamin-A, vitamin-B, and pyridoxal-5-phosphate (vitamin-B6) binding residues. VitaPred encodes each residue into a 340-D feature vector by applying a sliding window to the position-specific scoring matrix (PSSM) of a protein sequence; then, a support vector machine (SVM) is trained on the set of feature vectors of all the training residues. In the prediction stage, the feature vector of each residue in a query sequence is fed into the trained SVM, and the binding propensity of each residue is obtained; finally, a threshold is used to determine whether a residue is vitamin-interacting. Although VitaPred achieved great success in predicting protein-vitamin binding residues, there is still room for further improving the prediction performance: first, only the PSSM-derived feature was used in VitaPred, and other valuable features (e.g., protein secondary structure) were not well considered; second, the PSSM feature constructed in VitaPred may contain redundant information, which is useless or even harmful for performing prediction. This paper follows the pioneering work of Panwar et al. [30] and aims to further improve the performance of protein-vitamin binding residue prediction. A new predictor, called TargetVita, which utilises multiple sequence-derived features and heterogeneous SVMs ensemble based on feature selection, is developed. In TargetVita, three different types of features (i.e., a position-specific scoring matrix feature, a predicted secondary structure feature, and a vitamin binding propensity feature) are combined to form the original feature space; then, three feature selection methods are performed on the original feature space to extract three different feature subspaces, and heterogeneous SVMs are trained on the reduced feature subspaces. Finally, when performing prediction, the vitamin-binding propensity of each residue in a query sequence is predicted by averaging the outputs of the three trained heterogeneous SVMs. Experimental results obtained with four separate vitamin-binding benchmark datasets demonstrate that the proposed TargetVita is superior to VitaPred [30] and an average improvement of 10% in terms of the Matthews correlation coefficient (MCC) was achieved over independent validation tests.

Methods

Benchmark datasets

In this study, four benchmark datasets created by Panwar et al. [30] were utilised to evaluate the efficacy of the proposed method. For convenience, the four benchmark datasets are denoted as DVI (dataset of vitamin-interacting proteins), DVAI (dataset of vitamin-A-interacting proteins), DVBI (dataset of vitamin-B-interacting proteins), and DPLPI (dataset of pyridoxal-5-phosphate interacting proteins), respectively. Each benchmark dataset was constructed with a stringent procedure as follows [30]: taking DVI as an example, 1061 PDB IDs of proteins that make contact with vitamins were first collected from SuperSite documentation [31]; then, the sequences of all chains of these 1061 PDB IDs were downloaded from the Protein Data Bank [32]. Among the obtained sequences, 2720 sequences were finally chosen according to the results returned from the Ligand Protein Contact (LPC) web server [33] by taking 1061 PDB IDs as inputs. Then, a threshold of 5.0 Å was used to determine the vitamin-interacting residues: a residue was considered to be vitamin-interacting if the closest distance between atoms of the protein and the partner vitamin was within the threshold (5.0 Å) [30]; finally, the maximal pairwise sequence identity of the vitamin-binding sequences obtained in the above steps was further reduced to 25% by using BLASTCLUST [34], and the obtained 187 vitamin-interacting chains with 30156 vitamin-binding residues constituted the DVI. Similarly, DVAI, DVBI, and DPLPI, which consist of 31, 141, and 71 non-redundant sequences, respectively, were also constructed by repeating the above-mentioned steps. Four different independent validation datasets for DVI, DVAI, DVBI, and DPLPI, which consist of 46, 15, 27, and 16 non-redundant sequences, respectively, were also constructed [30]. In addition, to guarantee the independence of the independent validation subset, the maximal pairwise sequence identity of each independent validation dataset is less than 25%, and any sequence in an independent validation dataset shares <25% identity to the sequences in the corresponding training dataset. Table 1 summarises the detailed compositions of the four benchmark datasets. Details for constructing these datasets can be found in [30].
Table 1

Compositions of the training datasets and the corresponding independent validation datasets for the 4 types of vitamin-interacting benchmark datasets

DatasetTraining DatasetIndependent Validation DatasetTotal No. of Sequences
No. of Sequences(numP, numN) * No. of Sequences(numP, numN) *
DVI187(3016, 62122)46(654, 11676)233
DVAI31(538, 7376)15(181, 1441)46
DVBI141(2219, 50179)27(419, 8947)168
DPLPI71(1092, 26638)16(246, 5935)87

*numP and numN represent the numbers of positive (binding) and negative (non-binding) samples, respectively.

Compositions of the training datasets and the corresponding independent validation datasets for the 4 types of vitamin-interacting benchmark datasets *numP and numN represent the numbers of positive (binding) and negative (non-binding) samples, respectively. To evaluate the performance of TargetVita with non-vitamin binding proteins, we also constructed a non-vitamin binding dataset, denoted as NVD, from BioLip [35], which is the most recently released semi-manually curated database for biologically relevant ligand–protein interactions. We constructed the NVD as follows: First, all the sequences that do not interact with vitamins are extracted from BioLip; then, the maximal pairwise sequence identity of the extracted protein sequences is culled to 30% by using CD-Hit [36] program, and the reduced dataset is obtained. Moreover, if a given sequence in the reduced dataset shares >30% identity with a sequence in the training dataset DVI, we remove the sequence from the reduced dataset. Finally, the remaining 6676 sequences (with 1852390 residues) constitute NVD. All the datasets used in this study are included in Additional file 1.

Feature representation

Feature representation is a critical step in designing a machine-learning-based predictor. In this study, multiple sequence-derived features, which potentially have a positive impact on the performance improvement of protein-vitamin binding residue prediction, are extracted and combined to form an informative feature space.

1) Position-specific scoring matrix (PSSM)

Previous studies have demonstrated that the evolutionary information reflected by a position specific scoring matrix (PSSM) is a powerful feature source in many bioinformatics problems including protein-ligand binding predictions [20, 21, 37–40]. In view of this, PSSM was also taken as a feature source in this study. First, we obtained the original PSSM of a sequence by executing PSI-BLAST [41] to search the Swiss-Prot database through three iterations with 0.001 as the E-value cut-off against the sequence; then, each element x contained in PSSM was normalised by the logistic function f(x) = 1/(1 + e− ), and the normalised PSSM was obtained. Finally, the PSSM-based feature vector for each residue in the query sequence can be extracted with a sliding window as follows: for a residue at position i of the query sequence, its feature vector consists of the normalised PSSM elements of the query sequence corresponding to a sequence segment of length W centred on i. In this study, W, i.e., the size of sliding window, was set to 17, which has been demonstrated to be a better choice in VitaPred [30] and several other protein-ligand binding site prediction studies [37, 38]. Consequently, the dimensionality of the PSSM feature vector of a residue is 17 × 20 = 340.

2) Predicted secondary structures (PSSs)

A fundamental hypothesis for most of the sequence-based protein attribute predictions is that sequences with similar structures will have similar functions. Previous studies have also shown that a close relationship exists between protein structure and function. Many structural characteristics, such as secondary structure information, have been extensively investigated for the identification of protein functional residues (e.g., protein-ligand binding residues [40, 42]). The appropriate utilisation of protein structural information may potentially help to improve the performance of protein-ligand binding prediction, as has been empirically demonstrated in our recent work [21, 38]. Therefore, protein secondary structure information, predicted from the protein sequence by performing PSIPRED [43], was used as another feature source for protein-vitamin binding residue prediction. The predicted secondary structure information of a protein sequence is obtained by applying PSIPRED [43] software, which predicts the likelihood that a given residue in a protein sequence belongs to one of three secondary structure classes: coil (C), helix (H), and strand (E). More specifically, for a protein sequence with L residues, PSIPRED outputs an L × 3 probability matrix, which represents the predicted secondary structure information of the protein. Again, a sliding window of size 17 was used to extract the predicted secondary structure feature of each residue, and the dimensionality of the extracted PSSs feature vector was 17 × 3 = 51.

3) Vitamin binding propensities (VBPs)

Previous studies have demonstrated that different ligands tend to bind different residues [18, 19, 21]. Panwar et al. [30] also analysed different protein-interacting residues of different vitamin classes and observed that different vitamins tend to bind different residues; this phenomenon can also be observed within vitamin sub-classes. Motivated by this observation, we can thus calculate the binding propensities of the 20 native amino acids for each type of vitamin and then extract a vitamin-specific 17-D binding propensity feature vector, denoted as VBP, for each residue in a protein sequence by concatenating the binding propensities of its neighbouring residues within a window of size 17 centred at the residue. Finally, the feature representation of a residue is formed by serially combining its three corresponding feature vectors, i.e., PSSM, PSSs, and VBPs, and the dimensionality of the obtained feature vector is 340 + 51 + 17 = 408-D.

Ensemble multiple heterogeneous subspace SVMs based on feature selection

After determining the feature representation, prediction models can be trained on a dataset with machine-learning algorithms such as SVM, as used in this study. However, directly training prediction models on the original feature space is often not the best solution. One important reason is that redundant information that has no positive, and sometimes even negative, impact on the prediction performance could potentially exist. Selection of the most discriminative feature subspace from the original feature space may help to improve prediction performance. Accordingly, feature selection has been a hotspot and is widely used in many bioinformatics and related fields [44], such as sequence analysis [45, 46] and microarray analysis [47, 48]. For example, our recent work [49] has demonstrated that the PSSM feature contains redundant information, which is useless for disulphide connectivity prediction, and the prediction performance can be further improved when the original PSSM feature is reduced to a lower but more compact feature subspace via feature selection. However, many existing traditional feature selection methods such as data variance [49], the Fisher score [50], and the Laplacian score [51] are faced with two deficiencies: first, the importance of features is calculated individually; thus, the correlation and dependency of different feature components are neglected. Second, the dimensionality of the reduced feature subspace needs to be prescribed in advance of the feature selection process, which is often difficult or even impossible in practice. Recently, we developed a generalised Joint Laplacian Feature Weights Learning algorithm [52], denoted as JLFWL, which can effectively address the above-mentioned two deficiencies. Rather than computing feature weights one by one, JLFWL automatically determines the optimal size of the feature subspace and selects the best feature components from the original feature space by iteratively learning the feature weights jointly and simultaneously. Here, we briefly restate JLFWL; details can be found in [52]. Let matrix X = [x1, x2, ⋯, x] ∈ R be a training dataset, where M is the number of samples, N is the dimensionality of features, and x is the feature vector of the i-th training sample. Then, the Joint Laplacian Feature Weights Learning can be summarised in Algorithm 1. Note that in Algorithm 1, ε ≥ 0 is a parameter to control the -norm of w. In this study, ε is set to be 0.5. Different feature subspaces can be selected from the original feature space with different feature selection methods, and the discriminative characteristics of the obtained feature subspaces may also differ. Prediction models trained on these different feature subspaces potentially complement each other, which motivates us to propose an ensemble learning scheme, i.e., a multiple heterogeneous subspace SVMs ensemble based on feature selection, as follows: First, multiple feature subspaces are selected by applying different feature selection methods; then, based on these selected feature subspaces, multiple prediction models, which are termed as heterogeneous models, can be trained on the same dataset; for a query input, the final prediction output is obtained by ensembling the outputs of the trained heterogeneous prediction models. To demonstrate the efficacy of the proposed ensemble learning scheme, three feature selection methods (i.e., our JLFWL together with two traditional feature selection methods, such as the Fisher score [50] and Laplacian score [51]) are taken to perform feature selections, and SVM [53][54] is used as the base machine-learning algorithm to train multiple heterogeneous prediction models. We locally performed JLFWL on benchmark datasets and found that the optimal dimensionality of the feature subspace, which is automatically determined by JLFWL, is 386. For consistency, the dimensionalities of the feature subspaces obtained by Fisher score [50] and Laplacian score [51] feature selection methods are also set to be 386. In this study, C-SVM is used and there are no specific weights for the three heterogamous SVM models. Radial basis function is chosen as the kernel function. The other two parameters, i.e., the regularisation parameter γ and the kernel width parameter σ, are set according to the optimisation results from a grid search strategy in the LIBSVM software.

Workflow of the proposed TargetVita

Figure 1 illustrates the workflow of the proposed TargetVita. In the training stage, all the feature vectors of the residues in the training sequences constitute the training feature vector set; then, L feature subspaces can be selected by performing L different feature selection methods on the training feature vector set; based on the selected feature subspace, L heterogamous SVM models can be trained.
Figure 1

Workflow of the proposed TargetVita. Predicted binding residues, modelled 3D structure, and the vitamin are highlighted in red, green, and yellow colours, respectively. Arrows highlighted in blue colour denote the workflows in the training stage, while arrows in black colour denote the workflows in the prediction stage.

Workflow of the proposed TargetVita. Predicted binding residues, modelled 3D structure, and the vitamin are highlighted in red, green, and yellow colours, respectively. Arrows highlighted in blue colour denote the workflows in the training stage, while arrows in black colour denote the workflows in the prediction stage. In the prediction stage, for each residue in a query protein sequence, its PSSM and PSSs feature vectors are first extracted by calling PSI-BLAST and PSIPRED and applying a sliding window technique; then, its PSSM, PSSs, and VBP feature vectors are combined and filtered by the L selected feature subspaces, and the L filtered feature vectors (i.e., feature vectors after feature selection) are further fed to the L corresponding trained heterogamous SVMs that predict, for that residue, the scores related to vitamin interaction; the final score for the residue is obtained by averaging the predicted scores of those heterogeneous SVM models. Finally, a threshold T is used to determine whether the residue is vitamin-interacting: residues with scores above threshold are marked as vitamin-interacting. To help with the visualisation of the prediction results, MODELLER software [55] is taken to model the protein 3D structure from the sequence, and the predicted vitamin-interacting residues are highlighted in red on the modelled 3D structure.

Evaluation indexes

To evaluate the performance of the proposed method, four routinely used evaluation indexes in this field, i.e., Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), and the Matthews correlation coefficient (MCC) were taken: where TP, FP, TN, and FN are the abbreviations of True Positive, False Positive, True Negative, and False Negative, respectively. However, these four evaluation indexes are threshold-dependent, i.e., the values of these indexes vary with the threshold chosen. Clearly, it is impossible and unnecessary to report the values of these indexes under all the possible thresholds [20, 21, 37]. In light of this, two threshold selection strategies, which have been widely applied in the related fields, were taken to report the threshold-dependent evaluation indexes, i.e., Sn, Sp, Acc, and MCC. Strategy I: Threshold that balances the values of Sn and Sp The threshold-dependent evaluation indexes are reported with a threshold, denoted as T, at which the value of Sn is equal or roughly equal to that of Sp. For convenience, the evaluation results obtained under T will be termed as Balanced Evaluation in the subsequent descriptions. Strategy II: Threshold that maximises the value of MCC Because the MCC provides the overall measurement of the quality of the binary predictions, we thus reported the threshold-dependent evaluation indexes by choosing the threshold, denoted as T, which maximises the value of MCC of predictions. Similarly, for convenience, the evaluation results obtained under T will be termed as MaxMCC Evaluation. In addition, to evaluate the overall prediction quality of a prediction model, the AUC, which is the area under the Receiver Operating Characteristic (ROC) curve and is threshold-independent, was also taken.

Results and discussion

Inappropriate cross-validation will over-estimate prediction performance

Cross-validation methods are often used to evaluate the performance of a predictor [56, 57]. Previous studies have shown that the leave-one-out cross-validation (jack-knife test) is most stringent [58-61]. However, leave-one-out cross-validation is time-consuming, especially when a dataset is huge and a complicated prediction algorithm (such as SVM in this study) is used. Additionally, because we need make a fair comparison of the proposed method with VitaPred [30], the five-fold cross-validation, which was used by Panwar et al. [30] to evaluate VitaPred, was also adopted in this study. Another critical aspect that should be addressed here is the method of performing five-fold cross-validation. In fact, five-fold cross-validation can be performed at two different levels for the considered protein-vitamin binding residue prediction problem: (i) residue-level cross-validation and (ii) sequence-level cross-validation. Residue-level five-fold cross-validation is performed as follows: residues in all the training protein sequences are randomly partitioned into five equally sized, disjoint subsets; then, one subset is used for testing, and the remaining four subsets are used for training; this process is continued until all the five subsets of the training dataset are traversed. For sequence-level five-fold cross-validation, training protein sequences, rather than training residues, are randomly partitioned into five equally sized, disjoint subsets; then, residues in one subset are used for testing, and residues in the remaining four subsets are used for training; this practice is continued until all the five subsets of the training dataset are traversed. In reference [30], Panwar et al. used residue-level cross-validation to evaluate the performance of their VitaPred. However, we believe that residue-level cross-validation tends to over-estimate the performance of a prediction model and is therefore inappropriate. Next, we will empirically demonstrate this argument as follows: Note that to objectively and fairly compare our results with those obtained by VitaPred [30], the same machine-learning model (i.e., SVM) and the same feature representation (i.e., 340-D PSSM feature) were also used in our experiments. For each of the four benchmark datasets (i.e., DVI, DVAI, DVBI, and DPLPI), we performed residue- and sequence-level five-fold cross-validations. Tables 2 and 3 summarise the performance comparisons between residue- and sequence-level five-fold cross-validations on the four benchmark datasets under Balanced Evaluation and MaxMCC Evaluation, respectively. Note that in Tables 2 and 3, SVM-R and SVM-S denote the results obtained by residue- and sequence-level cross-validations, respectively.
Table 2

Performance comparisons between residue- and sequence-level five-fold cross-validations on DVI, DVAI, DVBI, and DPLPI under

DatasetMethod Sn (%) Sp (%) Acc (%) MCC AUC TP TN FP FN
DVIVitaPred* 78.5278.6178.600.370.87----
SVM-R 77.8881.3481.180.300.8723495053011592667
SVM-S 77.6580.1680.040.290.8723424979712325674
DVAIVitaPred* 72.7076.8976.510.320.83----
SVM-R 73.9877.9477.670.300.8539857491627140
SVM-S 72.1276.3476.060.280.8238856311745150
DVBIVitaPred* 83.3380.5180.770.420.90----
SVM-R 80.4483.8383.680.330.901785420638116434
SVM-S 79.8682.9082.770.320.891772415988581447
DPLPIVitaPred* 90.2092.6192.400.670.97----
SVM-R 91.4893.3893.300.550.9799924874176493
SVM-S 90.3892.6292.530.520.96987246721966105

*Data obtained from [30].

◇SVM-R: The re-implementation of VitaPred over residue-level cross-validation.

△SVM-S: The re-implementation of VitaPred over sequence-level cross-validation.

Table 3

Performance comparisons between residue- and sequence-level five-fold cross-validations on DVI, DVAI, DVBI, and DPLPI under

DatasetMethod Sn (%) Sp (%) Acc (%) MCC AUC TP TN FP FN
DVIVitaPred* 52.1996.7992.730.530.87----
SVM-R 52.6298.2996.180.540.8715866106310591430
SVM-S 52.2998.3296.190.540.8715776107610461439
DVAIVitaPred* 42.7597.5192.540.480.83----
SVM-R 43.4996.3992.800.410.852347110266304
SVM-S 40.1596.3992.570.390.822167109267322
DVBIVitaPred* 55.5798.0494.180.610.90----
SVM-R 58.7798.4596.770.590.90130449401778915
SVM-S 58.1898.4096.690.580.89129149373806928
DPLPIVitaPred* 79.7698.6296.910.810.97----
SVM-R 79.6799.1998.420.790.9787026422216222
SVM-S 80.8699.0798.360.790.9688326391247209

*Data excerpted from [30].

◇SVM-R: The re-implementation of VitaPred over residue-level cross-validation.

△SVM-S: The re-implementation of VitaPred over sequence-level cross-validation.

Performance comparisons between residue- and sequence-level five-fold cross-validations on DVI, DVAI, DVBI, and DPLPI under *Data obtained from [30]. ◇SVM-R: The re-implementation of VitaPred over residue-level cross-validation. △SVM-S: The re-implementation of VitaPred over sequence-level cross-validation. Performance comparisons between residue- and sequence-level five-fold cross-validations on DVI, DVAI, DVBI, and DPLPI under *Data excerpted from [30]. ◇SVM-R: The re-implementation of VitaPred over residue-level cross-validation. △SVM-S: The re-implementation of VitaPred over sequence-level cross-validation. In Table 2, it can be observed that the values of Sn, Sp, Acc, MCC, and AUC for SVM-R are consistently superior to those for SVM-S throughout the four benchmark datasets. Taking MCC as an example, SVM-R clearly outperforms SVM-S, and an average improvement of 2% was observed on the four benchmark datasets. Similar results can also be observed in Table 3. From the comparison results between SVM-R and SVM-S on the four considered datasets listed in Tables 2 and 3, we empirically demonstrated that the residue-level cross-validation does over-estimate the performance of a prediction model. We speculate that the main reason for this over-estimation is that during the residue-level cross-validation, some testing residues and training residues may originate from the same protein sequence and thus have much higher homology, which will lead to a better prediction performance. On the other hand, SVM-R is in fact a re-implementation of VitaPred [30] because the prediction model and feature representation used in SVM-R are exactly the same as those used in VitaPred. By revisiting Tables 2 and 3, we can find that the AUC values for SVM-R on the four benchmark datasets are 0.87, 0.85, 0.90, and 0.97, which are equal to that for VitaPred with only one minor exception (i.e., 0.85 and 0.83 for SVM-R and VitaPred, respectively, in the DVAI dataset), showing that SVM-R and VitaPred achieved almost equal overall prediction quality. However, several abnormal phenomena are observed when comparing the other four indexes, (i.e., Sn, Sp, Acc, and MCC) between SVM-R and VitaPred. Using the results in the DPLPI dataset under Balanced Evaluation as an example (refer to Table 2), the values of Sn, Sp, and Acc for VitaPred are 90.20%, 92.61%, and 92.40%, respectively, which are obviously lower than those for SVM-R (i.e., 91.48%, 93.38%, and 93.30%, respectively); however, the value of MCC for VitaPred is, unexpectedly, approximately 12% higher than that for SVM-R. A similar phenomenon can also be observed in the DVAI dataset. According to the definitions of Sn, Sp, Acc, and MCC and the relationships between them, this phenomenon should not appear. Based on the comparison between SVM-R and VitaPred, we speculate that the MCC reported in VitaPred [30] has been over-estimated or over-optimised, which will be further demonstrated in the subsequent independent validation test. Considering that the residue-level cross-validation will over-estimate the performance of a prediction model, together with the fact that VitaPred may possibly have over-estimated the MCCs on benchmark datasets, we will take SVM-S (i.e., a re-implementation of VitaPred) over sequence-level cross-validation evaluation, rather than VitaPred itself, as the baseline predictor to demonstrate the improvements in our proposed methods in the subsequent experiments.

Improving prediction performance by combining the features of PSSM, PSSs, and VBPs

In this section, we will demonstrate that the performance of protein-vitamin interaction prediction can be further improved by combining the features of PSSM, PSSs, and VBPs. The features of 340-D PSSM, 51-D PSSs, and 17-D VBPs are serially combined to form a 408-D discriminative feature, denoted as PSSM + PSSs + VBPs. We then evaluated the SVM-S with the PSSM + PSSs + VBPs feature as a model input on each of the four benchmark datasets over five-fold sequence-level cross-validation. Tables 4 and 5 summarise the performance comparisons between the PSSM + PSSs + VBPs and PSSM features under Balanced Evaluation and MaxMCC Evaluation, respectively. From Tables 4 and 5, we can see that the prediction performances are indeed improved on all the four benchmark datasets after incorporating PSSs and VBPs features into the PSSM feature under both Balanced Evaluation and MaxMCC Evaluation. Taking MCC, which is the overall measurement of the quality of the binary predictions, as an example, an average improvement of 1.5% was observed under both Balanced Evaluation and MaxMCC Evaluation. In terms of the AUC, which measures the overall prediction quality of a prediction mode, an average improvement of 1% was also observed. Figure 2 illustrates the ROC curves of the predictions with PSSM and PSSM + PSSs + VBPs features, respectively, on the DVI dataset over sequence-level five-fold cross-validation.
Table 4

Performance comparisons between  +  and features on DVI, DVAI, DVBI, and DPLPI datasets over five-fold sequence-level cross-validation under

DatasetFeature Sn (%) Sp (%) Acc (%) MCC AUC TP TN FP FN
DVI PSSM 77.6580.1680.040.290.8723424979712325674
PSSM + PSSs + VBPs 78.5582.0281.860.310.8823695095111171647
DVAI PSSM 72.1276.3476.060.280.8238856311745150
PSSM + PSSs + VBPs 72.1278.2877.860.290.8438857741602150
DVBI PSSM 79.8682.9082.770.320.891772415988581447
PSSM + PSSs + VBPs 80.7185.1484.960.350.901791427247455428
DPLPI PSSM 90.3892.6292.530.520.96987246721966105
PSSM + PSSs + VBPs 91.4893.0993.030.540.9799924798184093
Table 5

Performance comparisons between  +  and on DVI, DVAI, DVBI, and DPLPI datasets over five-fold sequence-level cross-validation under

DatasetFeature Sn (%) Sp (%) Acc (%) MCC AUC TP TN FP FN
DVI PSSM 52.2998.3296.190.540.8715776107610461439
PSSM + PSSs + VBPs 53.2298.3296.230.550.8816056108010421411
DVAI PSSM 40.1596.3992.570.390.822167109267322
PSSM + PSSs + VBPs 44.2496.6193.050.430.842387126250300
DVBI PSSM 58.1898.4096.690.580.89129149373806928
PSSM + PSSs + VBPs 59.5898.3396.690.590.90132249342837897
DPLPI PSSM 80.8699.0798.360.790.9688326391247209
PSSM + PSSs + VBPs 81.3299.1298.420.790.9788826403235204
Figure 2

ROC curves of the predictions with and  +  +  features, respectively, on the DVI dataset over sequence-level five-fold cross-validation.

Performance comparisons between  +  and features on DVI, DVAI, DVBI, and DPLPI datasets over five-fold sequence-level cross-validation under Performance comparisons between  +  and on DVI, DVAI, DVBI, and DPLPI datasets over five-fold sequence-level cross-validation under ROC curves of the predictions with and  +  +  features, respectively, on the DVI dataset over sequence-level five-fold cross-validation. We have also provided performance comparisons of different feature combinations in Additional file 2: Table S1.

Ensembling heterogeneous SVMs helps to improve the prediction performance

In this section, we will empirically demonstrate that the prediction performance of protein-vitamin interactions can be further improved by ensembling multiple heterogeneous SVMs. More specifically, we adopted three feature selection methods (i.e., data variance [49], Fisher score [50], and Laplacian score [51]) to select three different feature subsets from the original PSSM + PSSs + VBPs feature space; then, we trained three heterogeneous SVMs on the selected feature subsets. The final prediction was performed by averaging the outputs of the three trained SVMs. For comparison, we also directly trained an SVM with the original PSSM + PSSs + VBPs feature, denoted as no ensemble. Table 6 summarises the performance comparisons between ensemble and no ensemble on all the four considered datasets over sequence-level five-fold cross-validation under Balanced Evaluation. Figure 3 illustrates the ROC curves of the predictions with ensemble and no ensemble, respectively, on the DVI dataset over sequence-level five-fold cross-validation.
Table 6

Performance comparisons between with- and without-ensemble on DVI, DVAI, DVBI, and DPLPI datasets over five-fold sequence-level cross-validation under

DatasetEnsemble Sn (%) Sp (%) Acc (%) MCC AUC TP TN FP FN
DVINo78.5582.0281.860.310.8823695095111171647
Yes78.4584.1783.900.340.892366522859837650
DVAINo72.1278.2877.860.290.8438857741602150
Yes72.6879.8979.400.310.8539158931483147
DVBINo80.7185.1484.960.350.901791427247455428
Yes81.3485.4985.310.360.911805428987281414
DPLPINo91.4893.0993.030.540.9799924798184093
Yes91.3093.6593.560.560.9799724947169195
Figure 3

ROC curves of the predictions with ensemble and no ensemble, respectively, on the DVI dataset over sequence-level five-fold cross-validation.

Performance comparisons between with- and without-ensemble on DVI, DVAI, DVBI, and DPLPI datasets over five-fold sequence-level cross-validation under ROC curves of the predictions with ensemble and no ensemble, respectively, on the DVI dataset over sequence-level five-fold cross-validation. In Table 6, it can be observed that the values of the five evaluation indexes (i.e., Sn, Sp, Acc, MCC, and AUC) of the prediction under ensemble are consistently superior to that of the prediction under no ensemble, with only two exceptions: Sn in the DVI and DPLPI datasets. The prediction under ensemble achieved average improvements of approximately 2% and 1% for MCC and AUC, respectively. The results demonstrate that the heterogeneous SVMs trained with different feature subsets can complement with each other, which accounts for the improvements in the prediction performance.

Comparison with existing protein-vitamin interaction predictors

In this section, we will compare the proposed method, called TargetVita, with existing predictors for protein-vitamin prediction. Note that in TargetVita, the PSSM + PSSs + VBPs feature was used as the input feature, and the SVM ensemble based on feature selection was applied. To the best of our knowledge, VitaPred [30] is the only predictor that was specifically designed for protein-vitamin prediction; thus, we compare the proposed TargetVita with VitaPred under both a cross-validation test and an independent validation test. However, the cross-validation performance of VitaPred may have been overestimated because it was evaluated by performing residue-level cross-validation. In view of this, SVM-S, which is a re-implementation of VitaPred that was evaluated over sequence-level cross-validation, was taken to compare with the proposed TargetVita when performing the cross-validation test. As for the independent validation test, TargetVita will be compared with both VitaPred and the SVM-S.

A. Cross-validation test

Tables 7 and 8 illustrate the performance comparisons between TargetVita and SVM-S on the four datasets over five-fold sequence-level cross-validation under Balanced Evaluation and MaxMCC Evaluation, respectively. From Table 7, we can see that the values of the five evaluation indexes for TargetVita are consistently superior to that of SVM-S throughout the four benchmark datasets. Taking MCC and AUC as examples, which are the two indexes measuring the overall prediction performance of a predictor, TargetVita clearly outperforms SVM-S, and average improvements of 3.5% and 2% were observed, respectively. Under MaxMCC Evaluation (refer to Table 8), a similar phenomenon can also be observed, with only minor exceptions on Sn.
Table 7

Performance comparisons with SVM-S on DVI, DVAI, DVBI, and DPLPI datasets over five-fold sequence-level cross-validation under

DatasetMethod Sn (%) Sp (%) Acc (%) MCC AUC TP TN FP FN
DVISVM-S 77.6580.1680.040.290.8723424979712325674
TargetVita78.4584.1783.900.340.892366522859837650
DVAISVM-S 72.1276.3476.060.280.8238856311745150
TargetVita72.6879.8979.400.310.8539158931483147
DVBISVM-S 79.8682.9082.770.320.891772415988581447
TargetVita81.3485.4985.310.360.911805428987281414
DPLPISVM-S 90.3892.6292.530.520.96987246721966105
TargetVita91.3093.6593.560.560.9799724947169195

△SVM-S: The re-implementation of VitaPred over sequence-level cross-validation.

Table 8

Performance comparisons with existing predictors on DVI, DVAI, DVBI, and DPLPI datasets over five-fold sequence-level cross-validation under

DatasetMethod Sn (%) Sp (%) Acc (%) MCC AUC TP TN FP FN
DVISVM-S 52.2998.3296.190.540.8715776107610461439
TargetVita51.0698.5996.390.550.891540612448781476
DVAISVM-S 40.1596.3992.570.390.822167109267322
TargetVita44.4396.8193.250.440.852397141235299
DVBISVM-S 58.1898.4096.690.580.89129149373806928
TargetVita56.2198.8197.020.600.91124849582597971
DPLPISVM-S 80.8699.0798.360.790.9688326391247209
TargetVita74.0599.6198.600.800.9781226534104280

△SVM-S: The re-implementation of VitaPred over sequence-level cross-validation.

Performance comparisons with SVM-S on DVI, DVAI, DVBI, and DPLPI datasets over five-fold sequence-level cross-validation under △SVM-S: The re-implementation of VitaPred over sequence-level cross-validation. Performance comparisons with existing predictors on DVI, DVAI, DVBI, and DPLPI datasets over five-fold sequence-level cross-validation under △SVM-S: The re-implementation of VitaPred over sequence-level cross-validation.

B. Independent validation test

Performing only cross-validation comparisons to demonstrate the effectiveness of a newly developed method over an existing method is often not convincing, the reason being that the characteristics of the new method may be over-fitted and/or over-optimised to the underlying dataset for the purpose of pursuing positive comparison results [21, 62, 63]. Validation on fresh independent data has been considered as an important and necessary procedure when comparing different methods, and it has been widely applied in related research. With this view, we also performed independent validation tests to further demonstrate the superiority of the proposed TargetVita over existing protein-vitamin predictors. Tables 9 and 10 summarise the performance comparisons between TargetVita and existing predictors on independent validation tests under Balanced Evaluation and MaxMCC Evaluation, respectively. From Tables 9 and 10, two observations can be made as follows:
Table 9

Performance comparisons with existing predictors on the independent validation datasets under

DatasetMethod Sn (%) Sp (%) Acc (%) MCC AUC TP TN FP FN
DVIVitaPred* 73.7071.9872.070.22- - - - -
SVM-S 75.3878.5178.350.280.8549391672509161
TargetVita80.7381.0581.030.330.8952894632213126
DVAIVitaPred* 73.4872.8772.930.31-----
SVM-S 73.4879.2578.610.380.83133114229948
TargetVita79.0179.1879.160.410.86143114130038
DVBIVitaPred* 83.0568.7669.400.23-----
SVM-S 78.2881.4981.350.300.883287291165691
TargetVita81.3881.6981.680.320.903417309163878
DPLPIVitaPred* 84.1583.2283.260.33-----
SVM-S 85.7790.1890.000.440.95211535258335
TargetVita89.0289.3089.290.440.96219530063527

*Data excepted from [30].

△SVM-S: The re-implementation of VitaPred.

Table 10

Performance comparisons with existing predictors on the independent validation datasets under

DatasetMethod Sn (%) Sp (%) Acc (%) MCC AUC TP TN FP FN
DVIVitaPred* 41.7496.6393.720.38-----
SVM-S 47.0998.4095.680.520.8530811489187346
TargetVita47.0198.4295.690.520.8930811491185346
DVAIVitaPred* 30.3997.2289.770.37-----
SVM-S 32.0497.0989.830.380.8358139942123
TargetVita38.1296.8190.260.430.8669139546112
DVBIVitaPred* 49.4094.4992.470.35-----
SVM-S 52.0398.2596.180.530.882188790157201
TargetVita51.0698.6996.560.550.902148830117205
DPLPIVitaPred* 65.8598.4097.100.63-----
SVM-S 72.7699.1198.060.740.9517958825367
TargetVita74.3999.0798.090.750.9618358805563

*Data excepted from [30].

△SVM-S: The re-implementation of VitaPred.

Performance comparisons with existing predictors on the independent validation datasets under *Data excepted from [30]. △SVM-S: The re-implementation of VitaPred. Performance comparisons with existing predictors on the independent validation datasets under *Data excepted from [30]. △SVM-S: The re-implementation of VitaPred. First, the proposed TargetVita significantly outperforms VitaPred under both Balanced Evaluation and MaxMCC Evaluation. Taking MCC as an example, TargetVita achieved approximately 9% ~ 11% and 6% ~ 20% improvements on the four independent validation datasets under Balanced Evaluation and MaxMCC Evaluation, respectively. In addition, TargetVita also outperformed SVM-S, which is a re-implementation of VitaPred, and acted as the best performer with an average improvement of approximately 2.5% for MCC if compared with the second best performer, SVM-S. Second, the values of MCC for VitaPred over the cross-validation test and independent validation test differ significantly under both the Balanced Evaluation and MaxMCC Evaluation if compared with TargetVita and SVM-S. In other words, the performance evaluated over the cross-validation test is significantly better than that evaluated over the independent validation test for VitaPred, while similar performances were obtained over the cross-validation test and the independent validation test for TargetVita and SVM-S. Taking the results under MaxMCC Evaluation as an example, the values of MCC over the cross-validation test for VitaPred are 0.53, 0.48, 0.61, and 0.81, respectively (refer to Table 3), while the values of MCC over the independent validation test for VitaPred are 0.38, 0.37, 0.35, and 0.63, respectively (refer to Table 10), for the four considered vitamins. Then, we can calculate that the MCC differences between the cross-validation test and the independent validation test of VitaPred for the four considered vitamins are 0.15, 0.11, 0.26, and 0.18, respectively. By revisiting Table 8, together with Table 10, we calculate that the MCC differences between the cross-validation test and the independent validation test of the proposed TargetVita for the four considered vitamins are only 0.03, 0.01, 0.05, and 0.05, respectively. Similarly, the MCC differences of SVM-S can also be calculated from Table 3 and Table 10. Figure 4 illustrates the differences between the MCC values over the cross-validation test and the independent validation test for VitaPred, SVM-S, and TargetVita on the four considered vitamins under MaxMCC Evaluation. Note that in Figure 4, d, d, and d denote the MCC differences for VitaPred, SVM-S, and TargetVita, respectively, and only the MCC differences on the DPLPI dataset are explicitly labelled.
Figure 4

Differences between the values over the cross-validation test and over the independent validation test for VitaPred, SVM-S, and TargetVita on the four considered vitamins under .

Differences between the values over the cross-validation test and over the independent validation test for VitaPred, SVM-S, and TargetVita on the four considered vitamins under . From Figure 4, we can intuitively find that the proposed TargetVita and SVM-S achieve similar performances (in terms of MCC) over both the cross-validation test and independent validation test, as their MCC differences are small, while the performance of VitaPred over the independent validation test is significantly lower than that that over the cross-validation test with all four considered vitamins, indicating that the performance of VitaPred over the cross-validation test has potentially been over-estimated or over-optimised, thus leading to a lower generalisation capability (i.e., poor performance with independent fresh data). This observation further supports the speculation (i.e., the MCC of VitaPred has potentially been over-estimated) we made in a previous section.

C. Performance on a non-vitamin binding dataset

We then performed performance comparisons between the proposed TargetVita and VitaPred on the non-vitamin binding dataset NVD, and the results of the comparison are listed in Additional file 2: Table S2. Note that the results of VitaPred and TargetVita were obtained by feeding the 6676 sequences to their corresponding web servers with default threshold settings. From Table S2, we can clearly see that the proposed TargetVita achieved much better prediction performance than VitaPred. Among the 1852390 residues in the 6676 non-vitamin binding sequences, 46319 residues were mistakenly predicted as binding residues by VitaPred, while only 36361 false positives were obtained by TargetVita.

Conclusions

In this study, we have designed and implemented a new sequence-based predictor, called TargetVita, for protein-vitamin binding residue prediction. TargetVita performs prediction by utilising multiple features derived from protein sequences and effectively ensembling heterogeneous SVMs trained on different feature subspaces. Experimental results on benchmark datasets demonstrated that the proposed TargetVita can achieve good performance and is superior to existing protein-vitamin binding residue predictors. Our future work will focus on further improving the prediction performance of TargetVita by uncovering new effective feature sources and applying more powerful machine-learning algorithms.

Availability of supporting data

The data sets supporting the results of this article are included within the Additional file 1. Additional file 1: Datasets used in this study. (PDF 191 KB) Additional file 2: Table S1. Performance comparisons of different feature combinations over 5-fold sequence-level cross-validation under MaxMCC Evaluation. Table S2. Performance comparisons between the proposed TargetVita and VitaPred on the non-vitamin binding dataset NVD. (PDF 165 KB)
  53 in total

1.  Protein secondary structure prediction based on position-specific scoring matrices.

Authors:  D T Jones
Journal:  J Mol Biol       Date:  1999-09-17       Impact factor: 5.469

2.  The Protein Data Bank.

Authors:  H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

3.  Identification of DNA regulatory motifs using Bayesian variable selection.

Authors:  Mahlet G Tadesse; Marina Vannucci; Pietro Liò
Journal:  Bioinformatics       Date:  2004-04-29       Impact factor: 6.937

4.  POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids.

Authors:  D G Levitt; L J Banaszak
Journal:  J Mol Graph       Date:  1992-12

5.  Understanding and predicting druggability. A high-throughput method for detection of drug binding sites.

Authors:  Peter Schmidtke; Xavier Barril
Journal:  J Med Chem       Date:  2010-08-12       Impact factor: 7.446

6.  Machine learning in biomedicine and bioinformatics.

Authors:  Leif E Peterson; Xue-Wen Chen
Journal:  Int J Data Min Bioinform       Date:  2009       Impact factor: 0.667

7.  Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors:  Kuo-Chen Chou
Journal:  J Theor Biol       Date:  2010-12-17       Impact factor: 2.691

8.  Fpocket: an open source platform for ligand pocket detection.

Authors:  Vincent Le Guilloux; Peter Schmidtke; Pierre Tuffery
Journal:  BMC Bioinformatics       Date:  2009-06-02       Impact factor: 3.169

9.  BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions.

Authors:  Jianyi Yang; Ambrish Roy; Yang Zhang
Journal:  Nucleic Acids Res       Date:  2012-10-18       Impact factor: 16.971

10.  SuperSite: dictionary of metabolite and drug binding sites in proteins.

Authors:  Raphael André Bauer; Stefan Günther; Dominic Jansen; Carolin Heeger; Paul Florian Thaben; Robert Preissner
Journal:  Nucleic Acids Res       Date:  2008-10-08       Impact factor: 16.971

View more
  6 in total

1.  Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique.

Authors:  Xiaoying Wang; Bin Yu; Anjun Ma; Cheng Chen; Bingqiang Liu; Qin Ma
Journal:  Bioinformatics       Date:  2019-07-15       Impact factor: 6.937

2.  Prediction of disease-associated nsSNPs by integrating multi-scale ResNet models with deep feature fusion.

Authors:  Fang Ge; Ying Zhang; Jian Xu; Arif Muhammad; Jiangning Song; Dong-Jun Yu
Journal:  Brief Bioinform       Date:  2022-01-17       Impact factor: 11.622

Review 3.  Artificial Intelligence in Nutrients Science Research: A Review.

Authors:  Jarosław Sak; Magdalena Suchodolska
Journal:  Nutrients       Date:  2021-01-22       Impact factor: 6.706

4.  Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences.

Authors:  Binghuang Cai; Xia Jiang
Journal:  BMC Bioinformatics       Date:  2016-03-03       Impact factor: 3.169

5.  SAMbinder: A Web Server for Predicting S-Adenosyl-L-Methionine Binding Residues of a Protein From Its Amino Acid Sequence.

Authors:  Piyush Agrawal; Gaurav Mishra; Gajendra P S Raghava
Journal:  Front Pharmacol       Date:  2020-01-30       Impact factor: 5.810

6.  NAGbinder: An approach for identifying N-acetylglucosamine interacting residues of a protein from its primary sequence.

Authors:  Sumeet Patiyal; Piyush Agrawal; Vinod Kumar; Anjali Dhall; Rajesh Kumar; Gaurav Mishra; Gajendra P S Raghava
Journal:  Protein Sci       Date:  2019-11-07       Impact factor: 6.725

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.