Literature DB >> 27294123

Protein Remote Homology Detection Based on an Ensemble Learning Approach.

Junjie Chen¹, Bingquan Liu², Dong Huang³.

Abstract

Protein remote homology detection is one of the central problems in bioinformatics. Although some computational methods have been proposed, the problem is still far from being solved. In this paper, an ensemble classifier for protein remote homology detection, called SVM-Ensemble, was proposed with a weighted voting strategy. SVM-Ensemble combined three basic classifiers based on different feature spaces, including Kmer, ACC, and SC-PseAAC. These features consider the characteristics of proteins from various perspectives, incorporating both the sequence composition and the sequence-order information along the protein sequences. Experimental results on a widely used benchmark dataset showed that the proposed SVM-Ensemble can obviously improve the predictive performance for the protein remote homology detection. Moreover, it achieved the best performance and outperformed other state-of-the-art methods.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2016 PMID： 27294123 PMCID： PMC4875977 DOI： 10.1155/2016/5813645

Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411

1. Introduction

In computational biology, protein remote homology detection is the classification of proteins into structural and functional classes given their amino acid sequences, especially, with low sequence identities. Protein remote homology detection is a critical step for basic research and practical application, which can be applied to the protein 3D structure and function prediction [1, 2]. Although remote homology proteins have similar structures and functions, they lack easily detectable sequence similarities, because the protein structures are more conserved than protein sequences. When the protein sequence similarity is below 35% at the amino acid level, the alignment score usually falls into a twilight zone [3, 4]. Therefore, it is often a failure to detect protein remote homology by computational approaches only based on protein sequence features. To improve the specificity and sensitivity of the detection, we proposed an ensemble learning method, which can combine basic classifiers based on different feature spaces. Up to now, many methods for protein remote homology detection have been proposed, which can be categorized into three groups [5]: pairwise alignment algorithms, generative models, and discriminative classifiers. Early computational approaches for protein remote homology detection are pairwise alignment methods, which detect sequence similarities between any given two protein sequences by using Needleman-Wunsch global alignment algorithm [6, 7] and Smith-Waterman local alignment algorithm [8]. Later, some trade-off methods were proposed so as to trade reduced accuracy for improved efficiency, such as BLAST [9] and FASTA [10]. PSI-BLAST [11] iteratively builds a probabilistic profile of a query sequence and therefore a more sensitive sequence comparison score can be calculated [12]. After pairwise alignment methods, the predictive accuracy was significantly improved by using the generative algorithms. Generative models were iteratively trained by using positive samples of a protein family or superfamily; for example, HHblits [13] generates a profile hidden Markov model (profile-HMM) [14, 15] from the query sequence and iteratively searches through a large database. Currently the discriminative methods achieve the state-of-the-art performance [16-19]. Different from pairwise algorithm and generative methods, the discriminative methods can easily embed various characteristics of protein sequences and learn the information from both positive and negative samples in a given benchmark dataset. A key feature of discriminative method is that its input requires fixed length feature vectors. Therefore, some researchers proposed various feature vectors for protein representation. Some methods are based on sequence information, physical and chemical properties of proteins [20-22], or secondary structure information [23, 24], such as SVM-DR [25]. Some methods are based on kernel method, such as SVM-Pairwise [5], SVM-LA [26], motif kernel [27], mismatch [28], SW-PSSM [29], and profile kernel [30]. Later, the performance of discriminative approaches is further improved by Top-n-gram, because it can transform protein profiles into pseudo protein sequences, which contain the evolutionary information [31-33]. Although many discriminative methods for protein remote homology detection have been proposed based on various feature extracting techniques, there is no attempt to combine these methods using an ensemble learning method to improve predictive performance. An ensemble classifier [34, 35] is built by combining a set of basic classifiers in weighted voting strategy to give a final determination in classifying a query sample. Ensemble classifiers have achieved great success in many fields, including protein-protein interaction sites [36], protein fold pattern recognition [22, 37], tRNA detection [38, 39], microRNA identification [40-44], DNA binding protein identification [45], and eukaryotic protein subcellular location prediction [46], because they are able to learn a more expressive concept in classification compared to a single classifier and reduce the variance caused by a single classifier. In this study, inspired by the success of ensemble classifier in the other fields, we proposed an ensemble classifier for protein remote homology detection, called SVM-Ensemble, which combined three state-of-the-art discriminative methods with a weighted voting strategy. The three basic classifiers SVM-Kmer, SVM-ACC, and SVM-SC-PseAAC were constructed with Kmer, auto-cross covariance (ACC), and series correlation pseudo amino acid composition (SC-PseAAC), respectively. Experimental results on a widely used benchmark dataset [5] showed that SVM-Ensemble can obviously improve the predictive performance by combining various features. Moreover, SVM-Ensemble achieved an average ROC score of 0.945, outperforming the other start-of-the-art methods, indicating that it would be a useful computational tool for protein remote homology detection.

2. Materials and Methods

2.1. Benchmark Dataset

A widely used superfamily benchmark [5] was used to evaluate the performance of our method for protein remote homology detection. The classification problem definition and benchmark dataset are available at http://noble.gs.washington.edu/proj/svm-pairwise/. The same dataset has been used in a number of earlier studies [26, 47–50], allowing us to perform direct comparisons to the relative performance. The benchmark contains 54 families and 4352 proteins, which are derived from the SCOP database with version 1.53 and the similarities between any two sequences are less than E-value of 10−25. Remote homology detection can be treated as a superfamily classification problem. For each family, the proteins within the family were regarded as positive test samples, and the proteins outside the family but within the same superfamily were taken as positive training samples. Negative samples were selected from outside of the fold and split into training and testing sets. This process was repeated until each family had been tested. This yielded 54 families with at least 10 positive training examples and 5 positive test examples.

2.2. Profile-Based Protein Representation

Although some methods have achieved certain degree of success only by using amino acid sequence information, their performance is not satisfying. Recent studies demonstrated that the methods over profile-based protein sequences would show better performance because a profile is richer than an individual sequence as far as the evolutionary information is concerned [50, 53]. The frequency profile 𝕄 for protein P with L amino acids can be represented aswhere m (0 ≤ m ≤ 1) is the target frequency which reflects the probability of amino acid i (i = 1,2,…, 20) occurring at the sequence position j (j = 1,2,…, L) in protein P during evolutionary processes. For each column in 𝕄, the elements add up to 1. Each column can therefore be regarded as an independent multinomial distribution. The target frequency was calculated from the multiple sequence alignments generated by running PSI-BLAST [11] against the NCBI's NR with default parameters except that the number of iterations was set at 10 in the current study. The details of how to build a frequency profile can be found in [50]. Given the frequency profile 𝕄 for protein P, we can find the amino acid with maximum frequency in each column of 𝕄. These amino acids are combined to produce the profile-based protein representation. In a frequency profile 𝕄, the target frequencies reflect the probabilities of the corresponding amino acids appearing in the specific sequence positions. The higher the frequency is, the more likely the corresponding amino acid occurs. Thus, the produced profile-based protein sequence contains evolutionary information in the frequency profile. We convert the frequency profiles into a series of profile-based proteins. The existing sequence-based methods can therefore be directly performed on the protein representations for further processing.

2.3. Feature Vector Representations for Protein Sequences

In this study, three kinds of features have been employed to construct the SVM-Ensemble predictor, including Kmer, auto-cross covariance (ACC), and series correlation pseudo amino acid composition (SC-PseAAC). Suppose a protein sequence P with L amino acid residues can be represented aswhere R represents the amino acid residue at the sequence position i, such that R 1 represents the amino acid residue at the sequence position 1 and R 2 represents the amino acid residue at position 2 and so on. The three used representation methods can be described as follows.

2.3.1. Kmer

Kmer [56] is the simplest approach to represent the proteins, in which the protein sequences are represented as the occurrence frequencies of k neighboring amino acids.

2.3.2. Auto-Cross Covariance (ACC)

ACC transformation [60-62] is to build two signal sequences and then calculate the correlation between them. ACC results in two kinds of variables: autocovariance (AC) transformation and cross covariance (CC) transformation. AC variable measures the correlation of the same property between two residues separated by a distance of lag along the sequence. CC variable measures the correlation of two different properties between two residues separated by lag along the sequence. Autocovariance (AC) Transformation. Given a protein sequence P in (2), the AC variable can be calculated bywhere u is a physicochemical index, L is the length of the protein sequence, P (R ) means the numerical value of the physicochemical index u for the amino acid R , and is the average value for physicochemical index u along the whole sequence: In such a way, the length of AC feature vector is N∗LAG, where N is the number of physicochemical indices. LAG is the maximum of lag (lag = 1,2,…, LAG). Cross Covariance (CC) Transformation. Given a protein sequence P in (2), the CC variable can be calculated bywhere u 1, u 2 are two different physicochemical indices, L is the length of the protein sequence, and P (R ), P (R ) are the numerical value of the physicochemical indices u 1, u 2 for the amino acids R , R . , are the average value for physicochemical index values u 1, u 2 along the whole sequence and they can be calculated by (4). In such way, the length of the CC feature vector is N∗(N − 1)∗LAG, where N is the number of physicochemical indices. LAG is the maximum of lag (lag = 1,2,…, LAG). Therefore, the length of the ACC feature vector is N∗N∗LAG. In current implementation, three physicochemical properties were employed, including hydrophobicity, hydrophilicity, and mass (see Table S1 in Supplementary file, available online at http://dx.doi.org/10.1155/2016/5813645) extracted from AAindex [57, 63].

2.3.3. Series Correlation Pseudo Amino Acid Composition (SC-PseAAC)

SC-PseAAC [64] is an approach incorporating the contiguous local sequence-order information and the global sequence-order information into the feature vector of the protein sequence. Given a protein sequence P in (2), the SC-PseAAC [64] feature vector of P is defined:wherewhere f (i = 1,2,…, 20) is the normalized occurrence frequency of the 20 native amino acids in the protein P; the parameter λ is an integer, representing the highest counted rank (or tier) of the correlation along a protein sequence; w is the weight factor ranging from 0 to 1; and τ is the j-tier sequence-correlation factor that reflects the sequence-order correlation between all of the most contiguous residues along a protein sequence, which is defined aswhere H 1, H 2, and M are the hydrophobicity, hydrophilicity, and mass correlation functions given bywhere , , and are the substituting values of hydrophobicity, hydrophilicity, and mass values for amino acid R . They are all subjected to a standard conversion as described by the following equation:where we use ℝ (i = 1,2,…, 20) to represent the 20 native amino acids. The symbols h 1(R ), h 2(R ), and m(R ) represent the original hydrophobicity, hydrophilicity, and mass values (see Table S1 in Supplementary file) of the amino acid R . These aforementioned features can be generated by a web-server called Pse-in-one [56], which can be used to generate the desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of user's studies. It covers a total of 28 different modes, of which 14 are for DNA sequences, 6 are for RNA sequences, and 8 are for protein sequences.

2.4. Support Vector Machine

Support vector machine (SVM) is a supervised machine learning technique for classification task based on statistical theory [65, 66]. Given a set of fixed length training vectors with labels (positive and negative input samples), SVM can learn a linear decision boundary to discriminate the two classes. The result is a linear classification rule that can be used to classify new test samples. When the samples are linearly nonseparable, the kernel function can be used to map the samples to a high-order feature space in which the optimal hyper plane as decision boundary can be found. SVM has exhibited excellent performance in practice [54, 58, 67–73] and has a strong theoretical foundation of statistical learning. In this study, the publicly available Gist SVM package (http://www.chibi.ubc.ca/gist/) is employed. The SVM parameters are used by default of the Gist Package except that the kernel function is set as radial basis function.

2.5. Ensemble Classifier

The ensemble classifier is able to learn a more expressive concept in classification compared to a single classifier and reduces the variance caused by a single classifier. Therefore, it was employed in many fields and achieved great success [36, 37]. In this paper, we proposed a weighted voting strategy for protein remote homology detection, as shown in Figure 1. The ensemble framework of SVM-Ensemble was constructed by combining SVM-Kmer, SVM-ACC, and SVM-SC-PseAAC with weighted factors. The processing can be formulated as below.

Figure 1

Flowchart to show how the ensemble classifier is formed by combining three basic classifiers on superfamily-level. The ensemble strategy is first employed on superfamily-level, and then the query protein P is predicted belonging to the superfamily type with which its score is the highest.

Suppose the ensemble classifier is expressed by where ℂ represents the ith basic SVM classifier on superfamily S (1 ≤ j ≤ 54). That is, ℂ 1 represents the classifier SVM-Kmer that operates on the superfamily S 1, ℂ 2 represents the classifier SVM-ACC that operates on superfamily S 1, and ℂ 3 represents the classifier SVM-SC-PseAAC that operates on superfamily S 1. ℂ is the average performance of three basic classifiers on superfamily S with weighted voting strategy. In (12), the symbol ⊕ denotes the weighted voting operator. The three basic classifiers can be combined by using the following equation:where ℂ (P, S ) is the belief function or supporting degree for P belonging to S predicted by the ith basic classifier and w is the weighted factor assigned with the average ROC score of the ith basic classifier on superfamily S .

2.6. Performance Metrics for Evaluation

We evaluated the performance of different methods by employing the receiver operating characteristic (ROC) scores [55, 74–78]. Because the test sets have more negative than positive samples, simply measuring error-rates will not give a good evaluation of the performance. For the case, the best way to evaluate the trade-off between the specificity and sensitivity is to use ROC score. ROC score is the normalized area under a curve that is plotted with true positives as a function of false positives for varying classification thresholds. ROC score of 1 indicates a perfect separation of positive samples from negative samples, whereas ROC score of 0.5 denotes that random separation. ROC50 score is the area under the ROC curve up to the first 50 false positives.

3. Results and Discussion

3.1. The Influence of Parameters on the Predictive Performance of Basic Predictors

There are several parameters for each basic predictor, which should be optimized. For more information of these parameters, please refer to Materials and Method. In this study, we optimized them by using grid search. The influence of these parameters on the performance was shown in Figure 2, and the optimized values of the parameters and their results were shown in Table 1, from which we can see that SVM-Kmer achieved the best performance, followed by SVM-SC-PseAAC.

Figure 2

The performance of three basic predictors with all parameter combinations. k value of 2 and the LAG value of 14 were used in SVM-Kmer and SVM-ACC. SVM-SC-PseAAC achieves the best performance with λ = 5 and w = 0.2. Parameter w is mainly impact factor. However, parameter λ has minor impact on the performance.

Table 1

The performance of three basic predictors with optimal parameters on benchmark dataset.

Methods	Optimal parameters	ROC^[a]	ROC50^[a]
SVM-Kmer	k = 2	0.912	0.785
SVM-ACC	LAG = 14	0.787	0.483
SVM-SC-PseAAC	λ = 5, w = 0.2	0.911	0.657

[a]Average ROC and ROC50 scores.

3.2. Performance of Ensemble Classifier Based on Various Feature Combinations with Weighted Voting Strategy

As discussed above, predictors based on different feature sets showed different performance. In this study, in order to further improve the performance of protein remote homology detection, we employed an ensemble learning approach to combine various predictors. The performance of ensemble classifier combined various feature combinations was shown in Table 2. The best performance (ROC = 0.943, ROC50 = 0.744) can be achieved with the combination of all the three basic predictors and obviously outperformed all the three basic predictors in terms of both ROC score and ROC50 score. These results were not surprising. The three basic predictors were based on different features, and their predictive results are complementary. The performance can be improved by combining them with an ensemble learning method.

Table 2

Performance of ensemble classifier combining various predictors with weighted voting. The best performance was achieved by combining SVM-Kmer, SVM-ACC, and SVM-SC-PseAAC. The symbol ⊕ denotes the weighted voting operator.

Ensemble methods with superfamily-level strategy	ROC^[a]	ROC50^[a]
SVM-Kmer ⊕ SVM-ACC	0.929	0.767
SVM-Kmer ⊕ SVM-SC-PseAAC	0.937	0.715
SVM-ACC ⊕ SVM-SC-PseAAC	0.922	0.691
SVM-Kmer ⊕ SVM-ACC ⊕ SVM-SC-PseAAC	0.943	0.744

[a]Average ROC and ROC50 scores.

3.3. Feature Analysis for Discriminative Power

To further study the discriminative power of features in the three basic predictors, we employed a feature extraction method, called principal component analysis (PCA) [79], to calculate the discriminative weight vectors in the feature space. The process of PCA for extracting significant features can be found in [32, 80]. For each basic predictor, the top 10 most discriminative features in the feature space were shown in Table 3, from which we can see that, for the Kmer features, six of the most discriminative features contain the amino acid M, indicating the importance of this amino acid. For ACC features, the hydrophobicity (h 1) has important impact on the feature discrimination. For SC-PseAAC features, the amino acid M has the most discriminative power and features with small λ value are more important. Both ACC and SC-PseAAC features with strong discriminative power incorporate the sequence-order effects. These three kinds of features consider both sequence composition and sequence order effects. Therefore, SVM-Ensemble can further improve the performance by combining them in an ensemble learning approach.

Table 3

Top 10 most discriminative features in three feature spaces. These features describe the characteristics of proteins from various perspectives.

Rank	Kmer	ACC	SC-PseAAC
1	MH	CC_h¹h²,lag=9	M
2	WC	AC_h¹,lag=5	Y
3	IM	CC_h¹h²,lag=8	τ _h²,λ=1
4	MC	AC_h¹,lag=4	τ _h²,λ=4
5	MY	CC_h¹h²,lag=7	H
6	VM	AC_h¹,lag=14	τ _h¹,λ=4
7	YW	AC_m,lag=13	G
8	YR	CC_h¹m,lag=13	τ _h¹,λ=1
9	HW	CC_{h¹h²,lag=10}	τ _m,λ=1
10	MQ	AC_h¹,lag=8	τ _m,λ=3

Note: the subscript indexes in ACC features and SC-PseAAC features mean hydrophobicity (h 1), hydrophilicity (h 2), and mass (m).

3.4. Comparison with Other Related Predictors

Some state-of-the-art methods for protein remote homology detection were selected to compare with the proposed SVM-Ensemble. SVM-Pairwise [5] represents each protein as a vector of pairwise similarities to all proteins in the training set. The kernel of SVM-LA [26] measures the similarity between a pair of proteins by taking into account all the optimal local alignment scores with gaps between all possible subsequences. Mismatch kernel [28] is calculated based on occurrences of (k, m)-patterns in the data. Monomer-dist [47] constructs the feature vectors by the occurrences of short oligomers. SVM-DR is based on the distance-pairs; PseAACIndex is based on the pseudo amino acid composition (PseAAC). disPseAAC constructs the feature vectors by combining the occurrences of amino acid pairs within Chou's pseudo amino acid composition. Experimental results of various methods on SCOP 1.53 benchmark dataset were shown in Table 4. The SVM-Ensemble achieved the best performance, indicating that it is correct to combine different predictors via an ensemble learning approach.

Table 4

Performance comparison of different methods on the benchmark dataset.

Methods	ROC^[a]	ROC50^[a]	Source
SVM-Ensemble	0.943	0.744	This study

SVM-Pairwise	0.896	0.464	Liao and Noble, 2003 [5]
SVM-LA (β = 0.5)	0.925	0.649	Saigo et al., 2004 [26]
Mismatch	0.925	0.649	Leslie et al., 2004 [28]

Monomer-dist	0.919	0.508	Lingner and Meinicke, 2006 [47]
SVM-WCM	0.904	0.445	Lingner and Meinicke, 2008 [51]

SVM-Ngram-LSA	0.859	0.628	Dong et al., 2006 [48]
SVM-Pattern-LSA	0.879	0.626	Dong et al., 2006 [48]
SVM-Motif-LSA	0.859	0.628	Dong et al., 2006 [48]
SVM-Top-n-gram-combine-LSA	0.939	0.767	Liu et al., 2008 [4]

PseAACIndex (λ = 5)	0.880	0.620	Liu et al., 2013 [31, 52]
PseAACIndex-Profile (λ = 5)	0.922	0.712	Liu et al., 2013 [31, 52]
SVM-DR	0.919	0.715	Liu et al., 2014 [50, 53–55]
disPseAAC	0.922	0.721	Liu et al., 2015 [2, 32, 44, 45, 56–59]

[a]Average ROC and ROC50 scores.

4. Conclusions

In this study, we have proposed an ensemble classifier for protein remote homology detection, called SVM-Ensemble. It was constructed by combining three basic classifiers with a weighted voting strategy. Experimental results on a widely used benchmark dataset showed that our method achieved ROC score of 0.943, which is obviously better than the three basic predictors, including SVM-Kmer, SVM-ACC, and SVM-SC-PseAAC. Compared with some other state-of-the-art methods, the SVM-Ensemble achieved the best performance. Furthermore, by analyzing the discriminative power of these features, some interesting patterns were discovered. For the future work, more effective features and machine learning techniques will be explored. And evolutionary computation [81], the ensemble learning techniques, and neural-like computing models [82-87] would be applied to other bioinformatics problems, such as gene-disease relationship prediction [52, 88–92] and DNA motif identification [59, 93]. The amino acids physicochemical indices and corresponding values for hydrophobicity, hydrophilicity and mass.

80 in total

1. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.

Authors: Michael Remmert; Andreas Biegert; Andreas Hauser; Johannes Söding
Journal: Nat Methods Date: 2011-12-25 Impact factor: 28.547

2. An improved profile-level domain linker propensity index for protein domain boundary prediction.

Authors: Yanfeng Zhang; Bin Liu; Qiwen Dong; Victor X Jin
Journal: Protein Pept Lett Date: 2011-01 Impact factor: 1.890

3. iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach.

Authors: Bin Liu; Longyun Fang; Fule Liu; Xiaolong Wang; Kuo-Chen Chou
Journal: J Biomol Struct Dyn Date: 2015-03-03

4. PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou's PseAAC and Physicochemical Distance Transformation.

Authors: Bin Liu; Jinghao Xu; Shixi Fan; Ruifeng Xu; Jiyun Zhou; Xiaolong Wang
Journal: Mol Inform Date: 2014-09-26 Impact factor: 3.353

Protein Remote Homology Detection Based on an Ensemble Learning Approach.

1. Introduction

2. Materials and Methods

2.1. Benchmark Dataset

2.2. Profile-Based Protein Representation

2.3. Feature Vector Representations for Protein Sequences

2.3.1. Kmer

2.3.2. Auto-Cross Covariance (ACC)

2.3.3. Series Correlation Pseudo Amino Acid Composition (SC-PseAAC)

2.4. Support Vector Machine

2.5. Ensemble Classifier

2.6. Performance Metrics for Evaluation

3. Results and Discussion

3.1. The Influence of Parameters on the Predictive Performance of Basic Predictors

3.2. Performance of Ensemble Classifier Based on Various Feature Combinations with Weighted Voting Strategy

3.3. Feature Analysis for Discriminative Power

3.4. Comparison with Other Related Predictors

4. Conclusions

1. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.

2. An improved profile-level domain linker propensity index for protein domain boundary prediction.

3. iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach.

4. PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou's PseAAC and Physicochemical Distance Transformation.

5. Spiking neural P systems with thresholds.

6. Identify Golgi protein types with modified Mahalanobis discriminant algorithm and pseudo amino acid composition.

7. Predicting subcellular localization of mycobacterial proteins by using Chou's pseudo amino acid composition.

8. Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition.

9. Identification of real microRNA precursors with a pseudo structure status composition approach.

10. Prediction of protein-protein interaction sites using an ensemble method.

1. Graph Theory-Based Sequence Descriptors as Remote Homology Predictors.

2. Remote homology clustering identifies lowly conserved families of effector proteins in plant-pathogenic fungi.

3. Characterization and Prediction of Protein Flexibility Based on Structural Alphabets.