Literature DB >> 16108083

Profile-based string kernels for remote homology detection and motif extraction.

Rui Kuang1, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund, Christina Leslie.   

Abstract

We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs"--short regions of the original profile that contribute almost all the weight of the SVM classification score--and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets.

Entities:  

Mesh:

Substances:

Year:  2005        PMID: 16108083     DOI: 10.1142/s021972000500120x

Source DB:  PubMed          Journal:  J Bioinform Comput Biol        ISSN: 0219-7200            Impact factor:   1.122


  39 in total

Review 1.  Homology and phylogeny and their automated inference.

Authors:  Georg Fuellen
Journal:  Naturwissenschaften       Date:  2008-02-21

2.  Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel.

Authors:  Rui Kuang; Jianying Gu; Hong Cai; Yufeng Wang
Journal:  Genetica       Date:  2008-12-06       Impact factor: 1.082

3.  A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances.

Authors:  Laurent Noé; Donald E K Martin
Journal:  J Comput Biol       Date:  2014-12       Impact factor: 1.479

4.  Predicting protein-protein interactions through sequence-based deep learning.

Authors:  Somaye Hashemifar; Behnam Neyshabur; Aly A Khan; Jinbo Xu
Journal:  Bioinformatics       Date:  2018-09-01       Impact factor: 6.937

5.  Protein-ligand interaction prediction: an improved chemogenomics approach.

Authors:  Laurent Jacob; Jean-Philippe Vert
Journal:  Bioinformatics       Date:  2008-08-01       Impact factor: 6.937

Review 6.  Machine learning for in silico virtual screening and chemical genomics: new strategies.

Authors:  Jean-Philippe Vert; Laurent Jacob
Journal:  Comb Chem High Throughput Screen       Date:  2008-09       Impact factor: 1.339

7.  High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions.

Authors:  Phaedra Agius; Aaron Arvey; William Chang; William Stafford Noble; Christina Leslie
Journal:  PLoS Comput Biol       Date:  2010-09-09       Impact factor: 4.475

8.  Graphlet kernels for prediction of functional residues in protein structures.

Authors:  Vladimir Vacic; Lilia M Iakoucheva; Stefano Lonardi; Predrag Radivojac
Journal:  J Comput Biol       Date:  2010-01       Impact factor: 1.479

9.  DescFold: a web server for protein fold recognition.

Authors:  Ren-Xiang Yan; Jing-Na Si; Chuan Wang; Ziding Zhang
Journal:  BMC Bioinformatics       Date:  2009-12-14       Impact factor: 3.169

10.  A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis.

Authors:  Bin Liu; Xiaolong Wang; Lei Lin; Qiwen Dong; Xuan Wang
Journal:  BMC Bioinformatics       Date:  2008-12-01       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.