Literature DB >> 12386002

ProClust: improved clustering of protein sequences with an extended graph-based approach.

P Pipenbacher1, A Schliep, S Schneckener, A Schönhuth, D Schomburg, R Schrader.   

Abstract

MOTIVATION: The problem of finding remote homologues of a given protein sequence via alignment methods is not fully solved. In fact, the task seems to become more difficult with more data. As the size of the database increases, so does the noise level; the highest alignment scores due to random similarities increase and can be higher than the alignment score between true homologues. Comparing two sequences with an arbitrary alignment method yields a similarity value which may indicate an evolutionary relationship between them. A threshold value is usually chosen to distinguish between true homologue relationships and random similarities. To compensate for the higher probability of spurious hits in larger databases, this threshold is increased. Increasing specificity however leads to decreased sensitivity as a matter of principle. Sensitivity can be recovered by utilizing refined protocols. A number of approaches to this challenge have made use of the fact that proteins are often members of some larger protein family. This can be exploited by using position-specific substitution matrices or profiles, or by making use of transitivity of homology. Transitivity refers to the concept of concluding homology between proteins A and C based on homology between A and a third protein B and between B and C. It has been demonstrated that transitivity can lead to substantial improvement in recognition of remote homologues particularly in cases where the alignment score of A and C is below the noise level. A natural limit to the use of transitivity is imposed by domains. Domains, compact independent sub-units of proteins, are often shared between otherwise distinct proteins, and can cause substantial problems by incorrectly linking otherwise unrelated proteins.
RESULTS: We extend a graph-based clustering algorithm which uses an asymmetric distance measure, scaling similarity values based on the length of the protein sequences compared. Additionally, the significance of alignment scores is taken into account and used for a filtering step in the algorithm. Post-processing, to merge further clusters based on profile HMMs is proposed. SCOP sequences and their super-family level classification are used as a test set for a clustering computed with our method for the joint data set containing both SCOP and SWISS-PROT. Note, the joint data set includes all multi-domain proteins, which contain the SCOP domains that are a potential source of incorrect links. Our method compares at high specificities very favorably with PSI-Blast, which is probably the most widely-used tool for finding remote homologues. We demonstrate that using transitivity with as many as twelve intermediate sequences is crucial to achieving this level of performance. Moreover, from analysis of false positives we conclude that our method seems to correctly bound the degree of transitivity used. This analysis also yields explicit guidance in choosing parameters. The heuristics of the asymmetric distance measure used neither solve the multi-domain problem from a theoretical point of view, nor do they avoid all types of problems we have observed in real data. Nevertheless, they do provide a substantial improvement over existing approaches. AVAILABILITY: The complete software source is freely available to all users under the GNU General Public License (GPL) from http://www.bioinformatik.uni-koeln.de/~proclust/download/

Mesh:

Year:  2002        PMID: 12386002     DOI: 10.1093/bioinformatics/18.suppl_2.s182

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  21 in total

1.  Detection of homologous proteins by an intermediate sequence search.

Authors:  Bino John; Andrej Sali
Journal:  Protein Sci       Date:  2004-01       Impact factor: 6.725

2.  Performance comparison of gene family clustering methods with expert curated gene family data set in Arabidopsis thaliana.

Authors:  Kuan Yang; Liqing Zhang
Journal:  Planta       Date:  2008-05-21       Impact factor: 4.116

3.  Genome cluster database. A sequence family analysis platform for Arabidopsis and rice.

Authors:  Kevin Horan; Josh Lauricha; Julia Bailey-Serres; Natasha Raikhel; Thomas Girke
Journal:  Plant Physiol       Date:  2005-05       Impact factor: 8.340

4.  The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.

Authors:  Shibu Yooseph; Granger Sutton; Douglas B Rusch; Aaron L Halpern; Shannon J Williamson; Karin Remington; Jonathan A Eisen; Karla B Heidelberg; Gerard Manning; Weizhong Li; Lukasz Jaroszewski; Piotr Cieplak; Christopher S Miller; Huiying Li; Susan T Mashiyama; Marcin P Joachimiak; Christopher van Belle; John-Marc Chandonia; David A Soergel; Yufeng Zhai; Kannan Natarajan; Shaun Lee; Benjamin J Raphael; Vineet Bafna; Robert Friedman; Steven E Brenner; Adam Godzik; David Eisenberg; Jack E Dixon; Susan S Taylor; Robert L Strausberg; Marvin Frazier; J Craig Venter
Journal:  PLoS Biol       Date:  2007-03       Impact factor: 8.029

5.  SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale.

Authors:  Tamás Nepusz; Rajkumar Sasidharan; Alberto Paccanaro
Journal:  BMC Bioinformatics       Date:  2010-03-09       Impact factor: 3.169

6.  Genome-wide comparative gene family classification.

Authors:  Christian Frech; Nansheng Chen
Journal:  PLoS One       Date:  2010-10-15       Impact factor: 3.240

7.  Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing.

Authors:  Tobias Wittkop; Jan Baumbach; Francisco P Lobo; Sven Rahmann
Journal:  BMC Bioinformatics       Date:  2007-10-17       Impact factor: 3.169

8.  Ultrafast clustering algorithms for metagenomic sequence analysis.

Authors:  Weizhong Li; Limin Fu; Beifang Niu; Sitao Wu; John Wooley
Journal:  Brief Bioinform       Date:  2012-07-06       Impact factor: 11.622

9.  GFam: a platform for automatic annotation of gene families.

Authors:  Rajkumar Sasidharan; Tamás Nepusz; David Swarbreck; Eva Huala; Alberto Paccanaro
Journal:  Nucleic Acids Res       Date:  2012-07-11       Impact factor: 16.971

10.  Comprehensive computational analysis of bacterial CRP/FNR superfamily and its target motifs reveals stepwise evolution of transcriptional networks.

Authors:  Motomu Matsui; Masaru Tomita; Akio Kanai
Journal:  Genome Biol Evol       Date:  2013       Impact factor: 3.416

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.