Literature DB >> 24318998

Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection.

Bin Liu¹, Deyuan Zhang, Ruifeng Xu, Jinghao Xu, Xiaolong Wang, Qingcai Chen, Qiwen Dong, Kuo-Chen Chou.

Abstract

MOTIVATION: Owing to its importance in both basic research (such as molecular evolution and protein attribute prediction) and practical application (such as timely modeling the 3D structures of proteins targeted for drug development), protein remote homology detection has attracted a great deal of interest. It is intriguing to note that the profile-based approach is promising and holds high potential in this regard. To further improve protein remote homology detection, a key step is how to find an optimal means to extract the evolutionary information into the profiles.
RESULTS: Here, we propose a novel approach, the so-called profile-based protein representation, to extract the evolutionary information via the frequency profiles. The latter can be calculated from the multiple sequence alignments generated by PSI-BLAST. Three top performing sequence-based kernels (SVM-Ngram, SVM-pairwise and SVM-LA) were combined with the profile-based protein representation. Various tests were conducted on a SCOP benchmark dataset that contains 54 families and 23 superfamilies. The results showed that the new approach is promising, and can obviously improve the performance of the three kernels. Furthermore, our approach can also provide useful insights for studying the features of proteins in various families. It has not escaped our notice that the current approach can be easily combined with the existing sequence-based methods so as to improve their performance as well.
AVAILABILITY AND IMPLEMENTATION: For users' convenience, the source code of generating the profile-based proteins and the multiple kernel learning was also provided at http://bioinformatics.hitsz.edu.cn/main/~binliu/remote/

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2013 PMID： 24318998 PMCID： PMC7537947 DOI： 10.1093/bioinformatics/btt709

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

By March 2013, 89 003 experimentally determined protein structures were deposited in the Protein Data Bank (Berman ). However, this number is only about one-sixth of 539 616, the number of protein sequences held in the UniProtKB/Swiss-Prot database (Wu ). To timely use such vast amount of structure-unknown protein sequences for basic research and drug development, it is highly desired to determine their 3D structures and functions by means of homology approaches (Chou, 2004). Unfortunately, protein remote homology detection is still a challenging problem in bioinformatics. The early methods in dealing with this problem were based on the pairwise sequence comparison approaches, such as BLAST (Altschul ) and Smith–Waterman local alignment algorithm (Smith and Waterman, 1981). However, in many cases, this kind of sequence alignment method failed to detect remote homologies due to the low sequence similarities. Later methods to challenge this problem were based on the generative models to induce a probability distribution over the protein family, and then to generate the unknown proteins as new members of the family from the stochastic model. For example, the hidden Markov model (HMM) (Karplus ) can be trained iteratively in a semi-supervised manner by using both positively labeled and unlabeled samples of a particular family to generate the positive set (Qian and Goldstein, 2004). Recently, the discriminative methods, such as support vector machine (SVM) (Vapnik, 1998), were used to address this problem by focusing on the differences between protein families. The key of the SVM methods is the kernel function by which to compute the inner product between two samples in the feature space. The most straightforward approach to generate the kernels was based on the features extracted from protein sequences. SVM-Ngram (Dong ), SVM-pairwise (Liao and Noble, 2003) and SVM-LA (Saigo ) were three of the most successful sequence-based kernels. SVM-Ngram (Dong ) was based on the feature space that contains all short subsequence of length N. In SVM-pairwise (Liao and Noble, 2003), a protein sequence was represented as a vector of pairwise similarities to all protein sequences in the training set, and then inner product between these vector-space representations was taken as the kernel. SVM-LA (Saigo ) measured the similarity between a pair of proteins by taking all the optimal local alignment scores with gaps between all possible subsequences into account. Besides these kernels, several other sequence-based kernels were also proposed, such as Mismatch (Leslie ) and SVM-BALSA (Webb-Robertson ). The profile-based kernels could further improve the performance by using the evolutional information extracted from the profiles. For example, Top-n-grams (Liu ) extracted the profile-based patterns by considering the most frequent elements in the profiles; profile kernel (Kuang ) extracted the short substrings according to the profile-based ungapped alignment scores; some profile-based methods improved the predictive performance by developing more sensitive profiles. HHsearch method (Söding, 2005) was based on a novel profile using the HMM. In COMPASS (Sadreyev ), numerical profiles were generated to construct optimal profile–profile alignments and to estimate the statistical significance of the corresponding alignment scores. In the meantime, some other features and techniques have been applied to this field to further improve the predictive performance. For instance, the kernel combination methodology (VBKC) (Damoulas and Girolami, 2008) used a single multi-class kernel machine to combine various kernels based on different feature spaces; SVM-physicochemical distance transformation (PDT) (Liu ) combined the amino acid physicochemical properties and the profile features via PDT to incorporate the local sequence-order information of the entire protein sequences. Also, based on the similarities between protein sequences and natural languages, the natural language processing techniques were applied to this field. It was shown that the performance of building-block-based methods could be improved by using the latent semantic analysis (LSA) (Dong ). Moreover, PROTEMBED (Melvin ) detected protein remote homology by embedding protein sequences into a low-dimensional semantic space. As we can see from the aforementioned introduction, most of the top-performing methods were developed based on the features extracted from profiles. This is consistent with the fact that a profile is much richer than an individual sequence in encoding information. Also, biology is a natural science with historic dimension. All biological species have developed beginning from a limited number of ancestral species. It is true for protein sequence as well (Chou, 2004). Their evolution involves changes of single residues, insertions and deletions of several residues, gene doubling and gene fusion (Chou, 1995). With these changes accumulated for a long period, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common features, such as having basically the same biological function (Loewenstein ), folding topology, subcellular location and other attributes (Chou, 2013). Accordingly, the key to improve the performance of these methods is to find a suitable approach to extract the evolutionary information from the profiles. In view of this, the current study was initiated in an attempt to propose a profile-based protein representation by extracting the evolutionary information from the frequency profiles.

2 MATERIALS AND METHODS

As shown by a series of publications (Chen ; Liu ; Xiao ; Xu ) and summarized in a comprehensive review (Chou, 2011), to develop a useful statistical prediction method or model for a biological system, one needs to engage the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; and (v) provide the downloadable source code or a web-server for the prediction method. Below, let us describe how to deal these procedures.

2.1 SCOP benchmark

Suppose is a remote homology system investigated in the current study that contains 4352 protein sequences, which were taken from (Liao and Noble, 2003) at http://noble.gs.washington.edu/proj/svm-pairwise/. These proteins were selected from SCOP version 1.53 and extracted from the Astral database (Brenner ). None of the 4352 proteins has sequence pairwise similarity to any other with higher than in the E-value [for more about the E-value and its implication in reducing homology bias and redundancy, see (Brenner )]. These proteins were also used by others (Dong ; Lingner and Meinicke, 2006; Saigo ) to study remote homology detection. The 4352 proteins in can be classified into 853 superfamilies and 1356 families; i.e. where is the superfamily, is the family and the symbol represents the ‘union’ in the set theory. For readers’ convenience, the codes of the 4352 proteins and their sequence as well as the attributes of their families and superfamilies are given in Supplementary Material S1. Because some families and superfamilies in do not contain significant number of protein sequences, and also because the negative dataset for each protein family can be any proteins except those belonging to its own superfamily, it is not so straightforward but a little more complicated and subtle for how to select protein samples from to define the training and testing datasets. To provide a clear description, let us consider a different manner to address this. As demonstrated by many previous studies on a series of important biological topics, such as enzyme-catalyzed reactions (Zhou and Deng, 1984), inhibition of human immunodeficiency virus-1 reverse transcriptase (Althaus ), drug metabolism systems (Chou, 2010) and applying wenxiang diagram or graph (Chou ) to study protein–protein interactions (Zhou, 2011; Zhou and Huang, 2013), using graphical approaches to study complicated problems can provide an intuitive picture and useful insights for in-depth studying and analyzing various complicated relations in these systems (Lin and Lapointe, 2013). In view of this, let us also use graphic approach to describe the feature and relation of the families and superfamilies in , as shown in Figure 1, where the open circles denote the families or superfamilies that have significant number of protein sequences and the gray circles denote those that do not.

Fig. 1.

A tree (or 3-row) graph to show the remote homology system on the SCOP benchmark. Only the open circles are in the target of the 23 superfamilies and 54 families, while the circles in gray are outside of the target. See the text for further explanation Of the 1356 families in (cf. Equation 1), 54 contain significant number of proteins (see the third row of Fig. 1) and form a target family set ; i.e. Of the 853 superfamilies in , 23 contain at least one target family (see the open circles in the second row of Fig. 1) and form a target superfamily set ; i.e. Thus, we have meaning that is the subset of , and is the subset of ; each of the three contains 857, 1508 and 4352 proteins, respectively. Now, for each of the 54 families in the target family set , we can define a training dataset and testing dataset given by where the positive training dataset contains at least 10 of its superfamily members, none of which belongs to the family, and the positive testing dataset contains at least five protein domains within the family. The proteins in the negative training and testing datasets, and , were picked from by excluding the superfamily of the family and randomly split between the two in the same ratio as the positive ones. The 54 training and testing datasets thus obtained are given in the Supplementary Materials S2 and Supplementary Data, respectively.

2.2 Protein frequency profile

The frequency profile for protein with L amino acids can be represented by where 20 is the total number of standard amino acids; m (0 ≤ m ≤ 1) is the target frequency, which reflects the probability of amino acid i (i = 1,2, … ,20) occurring at the sequence position j (j = 1,2, … ,L) in protein during evolutionary processes. For each column in , the elements add up to 1. The target frequency is calculated from the multiple sequence alignments generated by running PSI-BLAST (Altschul ) against the NCBI’s NR with default parameters except that the number of iterations was not set at 1 but was set at 10 in the current study. The target frequency of amino acid i in sequence position j is calculated as: where f is the observed frequency of amino acid i in column j; β is a free parameter set to a constant value of 10, which is initially used by PSI-BLAST; α is the number of different amino acids in column j − 1; and g is the pseudo-count for amino acid i in protein sequence position j, which can be calculated as: where p is the background frequency of amino acid k, and q is the score of amino acid i being aligned to amino acid k in BLOSUM62 substitution matrix, which is the default score matrix of PSI-BLAST (Altschul ).

2.3 Profile-based protein representation

Although the methods by using amino acid sequence information achieve certain degree of success, only using sequence information cannot accurately detect protein remote homology. Recent studies demonstrated that the profile-based methods would show better performance because a profile is richer than an individual sequence as far as the encoding information is concerned. However, a profile is a 2D matrix, whereas a protein sequence is an amino acid string. Therefore, the 2D evolutionary profile information cannot be directly incorporated into the sequence-based methods for prediction. To deal with this problem, we propose an approach to convert the frequency profiles into a series of profile-based proteins. Thus, the existing sequence-based methods can be directly performed on these proteins for the prediction. The target frequencies in the frequency profiles reflect the probabilities of the corresponding amino acids appearing in the specific sequence positions. The higher the frequency is, the more likely the corresponding amino acid occurs. It is reasonable to use the nth most frequent amino acids in the frequency profiles to represent the protein sequences. Below is the description on how to convert frequency profiles into profile-based proteins. Given the frequency profile for protein (Equation 6), we can sort the amino acids in each column according to a descending order. The frequency profile thus obtained by the sorting operation is called the sorted frequency profile and denoted by . For each row in , the amino acids are combined to produce the profile-based protein. By following this approach, the frequency profile is converted into 20 profile-based proteins p1, p2, … , p20 (Supplementary Fig. S1 in Supplementary Material S4), which contain the evolutionary information in the frequency profile. These 20 proteins have different importance. During evolutionary process, protein is preferred to transform into p1, but not preferred to transform into p20. For reader’s convenience, the source code for generating the profile-based proteins is accessible by clicking the link at http://bioinformatics.hitsz.edu.cn/main/∼binliu/remote/.

2.4 Sequence-based kernels

Three state-of-the-art sequence-based kernels [SVM-Ngram (Dong ), SVM-pairwise (Liao and Noble, 2003) and SVM-LA (Saigo )] were used to validate whether the proposed approach could improve their performance. In SVM-Ngram (Dong ) method, the Ngrams were the set of all possible subsequences of amino acids of a fixed length. A protein sequence was mapped to a feature vector by the occurrence frequency of each Ngram. The value of N was set at 3 as suggested by the authors (Dong ), and therefore the dimension of the vector is 8000. At the heart of the SVM is a kernel function that acts as a similarity score between pairs of vectors. The kernel was normalized so that each vector had length 1 in the feature space: where X and Y are two proteins in the dataset. This normalized step was also used by SVM-pairwise (Liao and Noble, 2003) and SVM-LA (Saigo ). The normalized kernel thus obtained was then transformed into a radial basis kernel. In the SVM-pairwise (Liao and Noble, 2003) method, the feature vector was a list of pairwise sequence similarity scores, computed with respect to all of the sequences in the training set. The radial basis function was used as the kernel. The rest steps were the same as the ones used in SVM-Ngram (Dong ). In the SVM-LA (Saigo ), the kernel was calculated by summing up scores obtained from the local alignments with gaps between the two sequences, computed by Smith–Waterman dynamic programming algorithm. Such kernel might not be a positive definite kernel and the authors (Saigo ) provided two solutions for this problem. Owing to its performance and simplicity, we implemented one of the two methods, namely, the LA-ekm kernel. The parameters of LA-ekm kernel took the optimal values (β = 0.5, d = −11, e = −4).

2.5 Multiple kernel learning

The kernel described in the previous section can be used by kernel methods to train the SVM classifier. Each kernel contains different discriminative information, and therefore combining the kernels automatically is a promising way to improve the performance. In machine learning field, this approach is called multiple kernel learning (MKL) (Cortes ; Varma and Babu, 2009), which has attracted a lot of attention recently. The MKL technique aimed to combine different kernels to improve the performance, and showed the state-of-the-art results on image classification field (Varma and Babu, 2009). In this article, we focused on the weighted linear combination of kernels. The weight of each kernel can be optimized based on different criterion, which can be categorized by two groups. One group is the one-stage kernel learning methods, which optimize the weight and the SVM objective function simultaneously (Varma and Babu, 2009). These methods suffer from the high training complexity. The other group is two-stage kernel learning methods, which optimize the weight by using a criterion first and then train the SVM classifier using the kernel combined by the learned weight of each kernel. Compared with one-stage learning methods, the two-stage kernel learning methods showed better performance with reduced training cost. Therefore, in this study, we adopted the two-stage kernel learning method. Specifically, the kernel target alignment (KTA) objective function was used to optimize the weight of each kernel, which showed theoretical guarantees and can improve the performance in practice (Cortes ; Varma and Babu, 2009). Given m training samples x1, x2, … , x and their corresponding labels y1, y2, … , y, the ideal kernel matrix can be formulated as K = y, where y is the vector of labels [y1, y2, … , y]. For the given n kernels K, the aim is to learn the weight of each kernel. To avoid the kernel scaling problem, we center kernel K and the corresponding ideal kernel K in feature space by the following equation: where K is normalized by: Following the above steps, each kernel is normalized, and then these n kernels are linearly combined by the following equation: where w (, ) is the weight of kernel . The weight is learned by KTA objective function, which maximizes the alignment between and the centered ideal kernel . This leads to a quadratic program problem and can be solved quite efficiently. For implementation details, please refer to (Cortes ). In this study, three kernels (K, K1 and K2) for each selected sequence-based method were linearly combined by using the above KTA approach to further improve the performance. For reader’s convenience, the source code of the MKL is accessible by clicking the link at http://bioinformatics.hitsz.edu.cn/main/∼binliu/remote/.

2.6 SVM

SVM is a class of supervised learning algorithms first introduced by Vapnik (1998). SVM-based machine learning algorithm has been successfully used to investigate various problems in molecular biology, such as identifying DNA recombination spots (Chen ), membrane protein types (Cai ) and heat-shock protein functions (Feng ), among many others. In this study, the publicly available Gist SVM package (http://www.chibi.ubc.ca/gist/) was used.

2.7 Evaluation methodology

Because the test sets have many more negative than positive samples, simply measuring error-rates will not give a good evaluation of performance. For the cases in which the positive and negative samples are not evenly distributed, the best way to evaluate the trade-off between the specificity and sensitivity is to use a receiver operating characteristic (ROC) score (Gribskov and Robinson, 1996). An ROC score is the normalized area under a curve that plots true positives against false positives for different classification thresholds. A score of 1 means perfect separation of positive samples from negative ones, whereas a score of 0 means that none of the sequences selected by the algorithm is positive. Another performance measure is ROC50 score, which is the area under the ROC curve up to the first 50 false positives.

3 RESULTS AND DISCUSSION

3.1 Profile-based protein representation can improve the performance of methods based on sequence composition

The frequency profile of a protein can be converted into 20 profile-based proteins (p1, p2, … , p20) by using the proposed approach (see Section 2 for details). These 20 proteins have different importance. p1 is the most important protein, as it is the combination of the top frequent amino acids in frequency profile, whereas p20 is the profile-based protein to which protein is the most unlikely to convert because it is the combination of the amino acids with lowest frequencies in frequency profile. If all the 20 profile-based proteins are used in the prediction, the computational cost is relatively high. In this study, only the top n most important profile-based proteins (p1, … , pn) were used in the prediction. To select the value of n, the following experiment was conducted. The frequencies of 20 standard amino acids in each column of a frequency profiles add up to 1. Therefore, the average frequency is 0.05 (1/20 = 0.05). If an amino acid with frequency >0.05, it is likely to occur during evolutionary process; otherwise, it is not likely to occur. The percentage of the amino acids with frequencies >0.05 in each profile-based protein on the SCOP benchmark was calculated, and the results are shown in Figure 2. As we can see from the figure, such amino acids are abundant in profile-based proteins p1, p2 and p3 (99.99%, 99.60% and 98.13%, respectively), but for the other 17 profile-based proteins, the percentage decreases significantly (from 89.28 to 0%). Therefore, in this study, only the top three profile-based proteins were used in the prediction. These profile-based proteins were combined with three state-of-the-art methods based on sequence composition, including SVM-Ngram (Dong ), SVM-pairwise (Liao and Noble, 2003) and SVM-LA (Saigo ), and the results are shown in Supplementary Table S1 of Supplementary Material S4. For each of the three methods, the best performance was achieved for the top important protein p1. Compared with the methods performed on the raw protein sequence , the performance of the proposed methods can be improved by 3.7∼7.5% and 9.6∼13.7% in terms of average ROC and ROC50 scores, respectively, indicating that the proposed profile-based protein representation is useful for protein remote homology detection. The performance of the methods performed on p2 is similar as that of the methods performed on the raw protein . The predictive results of the methods performed on p3 were the lowest. These results are consistent with the different importance of the three profile-based proteins p1, p2 and p3.

Fig. 2.

Illustration to show the feature of frequency profile. Percentage of amino acids with frequencies > 0.05 in the 20 profile-based proteins derived from SCOP benchmark

3.2 Comparison with closely related methods

Besides the current method, there are some other methods for predicting protein remote homologies based on profiles, such as SVM-Top-n-gram-combine-LSA (Liu ), SVM-PDT-Profile (Liu ), Profile (Kuang ), BioSVM-2L (Muda ) and HHSearch (Söding, 2005). SVM-Top-n-gram-combine-LSA (Liu ) extracted the building blocks of proteins from the frequency profiles, which could be treated as the ‘words’ of protein language. The LSA (Dong ) was applied to further improve the performance of this method. SVM-PDT-Profile (Liu ) combined the amino acid physicochemical properties in the Amino Acid Index (AAIndex) (Kawashima ) with the frequency profiles for the prediction. The feature vector of Profile method (Kuang ) was constructed by the short subsequences whose PSSM-based ungapped alignment score was above a predefined threshold. BioSVM-2L constructed two-layer SVM classifiers with profile-based kernels (Muda ). All the above three methods were based on SVM, and the difference among them was in the extracted features. HHSearch (Söding, 2005) was one of the best protein remote homology detection methods, which used a novel profile-based HMM. The results obtained by these four methods on the SCOP benchmark are listed in Supplementary Table S1 of Supplementary Material S4, from which we can see that the current method outperforms SVM-Top-n-gram-combine-LSA (Liu ), SVM-PDT-Profile (Liu ) and BioSVM-2L (Muda ) and is highly comparable with Profile (Kuang ) and HHSearch (Söding, 2005), indicating that the profile-based protein representation is a promising approach to extract the evolutionary information from frequency profiles for protein remote homology detection.

3.3 Combining different methods via MKL

As mentioned above, the approaches based on the top two profile-based proteins p1, p2 and the raw protein are among the top performing methods. It is interesting to investigate whether these methods can be combined to further improve the performance. In this study, the MKL framework was used to combine these methods. The KTA method was used to automatically optimize the weight of each kernel on the training set, and then these kernels are combined with weights into a single kernel for the SVM-based prediction. The results are shown in Supplementary Table S2 of Supplementary Material S4 as well as Supplementary Materials S5–S7. The MKL approach can improve the performance of SVM-Ngram (Dong ), but only has minor impact on the SVM-pairwise (Liao and Noble, 2003) and SVM-LA (Saigo ). To uncover the reason, the weight of each kernel was analyzed. For each kernel, the average weight on all the 54 protein families is shown in Supplementary Table S2 of Supplementary Material S4. For these three methods, the p1-based kernel was weighted most heavily. For SVM-pairwise (Liao and Noble, 2003) and SVM-LA (Saigo ), the weight values of their corresponding and p2 kernels are <0.1, indicating these kernels only have minor impact on the final results, and hence the performance improvement is modest. VBKC (Damoulas and Girolami, 2008) is another method based on the MKL, which combined four string kernels: SVM-pairwise (Liao and Noble, 2003), SVM-LA (Saigo ), SVM-MM (Leslie ) and SVM-Mono (Lingner and Meinicke, 2006). Our proposed SVM-pairwise-KTA and SVM-LA-KTA outperform VBKC (Damoulas and Girolami, 2008) by 1.2 ∼ 2.2% and 29.9 ∼ 31.3% according to the average ROC and ROC50 scores, respectively. The obvious performance improvement is mainly due to the proposed profile-based protein representation and MKL approach.

3.4 Correlations between discriminative power of Ngrams and protein families

The SVM-Ngram (Dong ) method is based on the explicit feature space representation, which provides the possibility to measure the correlations between Ngrams and protein families. The sequence-specific weight learnt from the SVM training process can be used to calculate the discriminant weight for each Ngram, which indicates the importance of the corresponding Ngram. By following Lingner and Meinicke’s approach (Lingner and Meinicke, 2008), given the weight vector of a set of M sequences obtained from the kernel-based training process α = [α1, α2, α3, … , α], the discriminant weight vector w in the feature space can be calculated by the following equation: where is the matrix of sequence representatives. The magnitude of the element in w represents the discriminative power of the corresponding feature. In most protein families, kernel p1 is weighted more heavily than kernel and kernel p1. Two such protein families (SCOP ID: 2.1.1.4 and 3.2.1.5) were selected from the SCOP benchmark for further study, and the results are shown in Supplementary Tables S3 and Supplementary Data of Supplementary Material S4, respectively. For each kernel, the top 10 most discriminative Ngram features calculated by Equation 14 are shown in the tables too. For protein family 2.1.1.4, kernel and kernel p1 share some common most discriminative Ngrams, such as ‘mtm’, ‘yty’, ‘mtf’ and ‘wwf’, indicating these Ngrams remain stable during evolutionary process and therefore these Ngrams would be the important sequence patterns for maintaining the structure and function of this protein family (Supplementary Table S3 of Supplementary Material S4). However, there are a few common most discriminative Ngrams between kernel and kernel p1 in protein family 3.2.1.5. The top 10 most discriminative Ngrams of kernel p1 are all different from those in kernel (Supplementary Table S4 of Supplementary Material S4). These Ngrams would contribute to the higher discriminative power of kernel p1 for this protein family. Although in most cases, kernel p1 was weighted most heavily, some exceptions were observed. For example, for protein family 7.3.6.1, kernel p2 is the most discriminative kernel with weight value of nearly 1, while the other two kernels only have little contribution to the MKL (Supplementary Table S5 of Supplementary Material S4). The top 10 most discriminative Ngrams for each kernel were investigated, and the results are shown in Supplementary Table S5 of Supplementary Material S4, from which some interesting patterns can be observed. The Ngrams containing amino acids ‘n’ and ‘f’ tended to show strong discriminative power in both kernel and kernel p1, whereas amino acid ‘a’ was abundant in the top discriminative Ngrams in kernel p2, indicating the Ngrams with amino acid ‘a’ could better describe the prosperities of protein family 7.3.6.1 in the evolutionary process.

3.5 Application of the proposed remote homology detection methods for studying the 3D structure of Nck5a

In addition to provide useful insights for evolution study, protein remote homology detection is useful for drug development as well. As is well known, many drug-targeted proteins are still without X-ray or nuclear magnetic resonance structure. Pharmaceutical scientists have to resort to the homology modeling technique or structural bioinformatics tools (Chou, 2004) to timely develop their 3D structures, so as to be able to conduct molecular docking study (Chou ; Wang ), one of the key steps in structure-based drug design. However, a reliable template, or a structure-known protein homologous to the target protein, is the necessary prerequisite in this regard (Chou, 2004). Unfortunately, many target proteins did not have significant sequence similarity with any structure-known proteins, and hence it was hard to find a proper template to develop their 3D structures. Actually, many of them did have structure-known homologous proteins, but the problem was how to detect them. For example, the sequence similarity between Nck5a and CyclinA was <20% (Chou ) and hence their homologous relationship could not be detected by the simple sequence alignment technique (Mohabatkar, 2010). Now let us see what will happen if the current remote homology detection technique is applied. To realize this, a dataset was constructed based on SCOP, from which 11 proteins were selected as the positive samples in the cyclin family (SCOP ID: a.74.1.1), while 3605 negative samples were selected from the SCOP version 1.67 by excluding all the proteins within the cyclin-like superfamily. None of these proteins shares >95% sequence similarity. Trained with such 11 positive proteins and 3605 negative proteins, the proposed best performing method SVM-LA (p1) was used to predict Nck5a. It was found that Nck5a is homologous to CyclinA, fully consistent with the experimental results obtained by the site-directed mutagenesis studies (Tang ). Actually, Chou did use CyclinA as a template to construct the 3D structure of activation domain of Nck5a, one of the important parts of tau protein kinase II, an important therapeutic target against Alzheimer’s disease. Furthermore, based on the computed structure thus obtained, the molecular truncation experiments (Zhang ) were conducted with an outcome that confirmed and validated the structure computed by using such a remote homologous protein as a template. Therefore, it is anticipated that the proposed method for detecting remote homology proteins will certainly enhance the power of homology modeling, and hence have impacts on drug development as well.

4 CONCLUSION

Discriminative methods based on SVM are the most effective and accurate methods for protein remote homology detection. The performance of the SVM-based methods depends on the kernel function, which measures the similarity between the samples in any pair. Varieties of kernels based on sequence composition have been proposed. However, these methods often fail to accurately predict the proteins sharing low sequence similarity. Recently, methods using the evolutionary information extracted from profiles achieved great success, such as Profile (Kuang ), SW-PSSM (Rangwala and Karypis, 2005), SVM-Top-Ngram (Liu ) and SVM-ACC (Liu ). A key step to improve the performance of these methods is in how to find a suitable approach to incorporate the evolutionary information extracted from the profiles for prediction. In this article, we proposed a method that can convert the frequency profile into a series of profile-based proteins. Three state-of-the-art sequence-based kernels, i.e. SVM-Ngram (Dong ), SVM-pairwise (Liao and Noble, 2003) and SVM-LA (Saigo ), were selected for demonstration on a well-known benchmark. It was shown that the methods based on the profile-based proteins p1 and p2 achieved the best performance, outperforming the original three string kernels by 3.7 ∼ 7.5% and 9.6 ∼ 13.7%, respectively, according to the average ROC and ROC50 scores. These results are fully consistent with our previous findings that the top two most frequent amino acids show stronger discriminative power than the other low frequent amino acids in the frequency profiles (Liu ), further confirming that the proposed profile-based protein representation is a promising approach in extracting the evolutionary information from frequency profiles for protein remote homology detection. It has not escaped our notice that the current approach can be easily combined with sequence-based methods, and hence, with the development of the sequence-based kernels, the currently proposed method can be further improved accordingly. It is instructive to point out that since the concept of pseudo amino acid composition, or Chou’s PseAAC (Lin and Lapointe, 2013), was introduced in 2001 (Chou, 2001), it has been successfully used to predict various attributes of proteins (e.g. Chen and Li, 2013; Chou, 2005; Georgiou ; Huang and Yuan, 2013; Khosravian ; Mohabatkar, 2010; Mohabatkar , 2013; Mohammad Beigi ; Nanni ; Sahu and Panda, 2010; Zhang ; Zhou ; Liu ). Accordingly, the potential would be high to develop a powerful method for protein remote homology detection by combing PseAAC with profile-based protein representation. In the original PseAAC, it only uses three indices, including the hydrophobicity index, hydrophilicity index and side-chain mass index. Because protein remote homology detection is a more difficult problem, proteins in the dataset only share low sequence similarity. Only these three indices would not be enough to capture the different properties of various proteins. Therefore, our further research will focus on incorporating new amino acid indices into PseAAC and applying it to protein remote homology detection. Funding: National Natural Science Foundation of China [numbers 61300112, 61370165, 61173075, 61272383 and 61370010), the Natural Science Foundation of Guangdong Province [numbers S2012040007390 and S2013010014475], the Scientific Research Innovation Foundation in Harbin Institute of Technology (project number HIT.NSRIF.2013103), the Shanghai Key Laboratory of Intelligent Information Processing, China [grant number IIPL-2012-002]. Conflict of Interest: none declared Click here for additional data file.

63 in total

1. Support vector machines for predicting membrane protein types by using functional domain composition.

Authors: Yu-Dong Cai; Guo-Ping Zhou; Kuo-Chen Chou
Journal: Biophys J Date: 2003-05 Impact factor: 4.033

2. Mismatch string kernels for discriminative protein classification.

Authors: Christina S Leslie; Eleazar Eskin; Adiel Cohen; Jason Weston; William Stafford Noble
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

3. Profile-based direct kernels for remote homology detection and fold recognition.

Authors: Huzefa Rangwala; George Karypis
Journal: Bioinformatics Date: 2005-09-27 Impact factor: 6.937

4. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection.

Authors: Theodoros Damoulas; Mark A Girolami
Journal: Bioinformatics Date: 2008-03-31 Impact factor: 6.937

5. The convergence-divergence duality in lectin domains of selectin family and its implications.

Authors: K C Chou
Journal: FEBS Lett Date: 1995-04-17 Impact factor: 4.124

6. Hidden Markov models for detecting remote protein homologies.

Authors: K Karplus; C Barrett; R Hughey
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

7. Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E.

Authors: I W Althaus; J J Chou; A J Gonzales; M R Deibel; K C Chou; F J Kezdy; D L Romero; P A Aristoff; W G Tarpley; F Reusser
Journal: J Biol Chem Date: 1993-03-25 Impact factor: 5.157

8. Identification of the N-terminal functional domains of Cdk5 by molecular truncation and computer modeling.

Authors: Jianwen Zhang; Chi-Hao Luan; Kuo-Chen Chou; Gail V W Johnson
Journal: Proteins Date: 2002-08-15

9. Performance of an iterated T-HMM for homology detection.

Authors: Bin Qian; Richard A Goldstein
Journal: Bioinformatics Date: 2004-03-25 Impact factor: 6.937

10. Prediction of protein binding sites in protein structures using hidden Markov support vector machine.

Authors: Bin Liu; Xiaolong Wang; Lei Lin; Buzhou Tang; Qiwen Dong; Xuan Wang
Journal: BMC Bioinformatics Date: 2009-11-20 Impact factor: 3.169

97 in total

1. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition.

Authors: Hao Lin; En-Ze Deng; Hui Ding; Wei Chen; Kuo-Chen Chou
Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971

2. FALCON@home: a high-throughput protein structure prediction server based on remote homologue recognition.

Authors: Chao Wang; Haicang Zhang; Wei-Mou Zheng; Dong Xu; Jianwei Zhu; Bing Wang; Kang Ning; Shiwei Sun; Shuai Cheng Li; Dongbo Bu
Journal: Bioinformatics Date: 2015-10-10 Impact factor: 6.937

3. repRNA: a web server for generating various feature vectors of RNA sequences.

Authors: Bin Liu; Fule Liu; Longyun Fang; Xiaolong Wang; Kuo-Chen Chou
Journal: Mol Genet Genomics Date: 2015-06-18 Impact factor: 3.291

4. iAFP-Ense: An Ensemble Classifier for Identifying Antifreeze Protein by Incorporating Grey Model and PSSM into PseAAC.

Authors: Xuan Xiao; Mengjuan Hui; Zi Liu
Journal: J Membr Biol Date: 2016-11-03 Impact factor: 1.843

5. Evolutionary mechanism and biological functions of 8-mers containing CG dinucleotide in yeast.

Authors: Yan Zheng; Hong Li; Yue Wang; Hu Meng; Qiang Zhang; Xiaoqing Zhao
Journal: Chromosome Res Date: 2017-02-09 Impact factor: 5.239

6. Sequence-specific flexibility organization of splicing flanking sequence and prediction of splice sites in the human genome.

Authors: Yongchun Zuo; Pengfei Zhang; Li Liu; Tao Li; Yong Peng; Guangpeng Li; Qianzhong Li
Journal: Chromosome Res Date: 2014-04-12 Impact factor: 5.239

7. Protein remote homology detection by combining Chou's distance-pair pseudo amino acid composition and principal component analysis.

Authors: Bin Liu; Junjie Chen; Xiaolong Wang
Journal: Mol Genet Genomics Date: 2015-04-21 Impact factor: 3.291

8. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches.

Authors: Bin Liu; Xin Gao; Hanyu Zhang
Journal: Nucleic Acids Res Date: 2019-11-18 Impact factor: 16.971

9. Probabilistic inference of biological networks via data integration.

Authors: Mark F Rogers; Colin Campbell; Yiming Ying
Journal: Biomed Res Int Date: 2015-03-22 Impact factor: 3.411

10. Comparative analysis and prediction of nucleosome positioning using integrative feature representation and machine learning algorithms.

Authors: Guo-Sheng Han; Qi Li; Ying Li
Journal: BMC Bioinformatics Date: 2021-06-02 Impact factor: 3.307