Literature DB >> 17941985

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing.

Tobias Wittkop1, Jan Baumbach, Francisco P Lobo, Sven Rahmann.   

Abstract

BACKGROUND: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed.
RESULTS: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools (Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences (66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet.
CONCLUSION: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17941985      PMCID: PMC2147039          DOI: 10.1186/1471-2105-8-396

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


  15 in total

1.  GeneRAGE: a robust algorithm for sequence clustering and domain detection.

Authors:  A J Enright; C A Ouzounis
Journal:  Bioinformatics       Date:  2000-05       Impact factor: 6.937

2.  ProClust: improved clustering of protein sequences with an extended graph-based approach.

Authors:  P Pipenbacher; A Schliep; S Schneckener; A Schönhuth; D Schomburg; R Schrader
Journal:  Bioinformatics       Date:  2002       Impact factor: 6.937

3.  The ASTRAL Compendium in 2004.

Authors:  John-Marc Chandonia; Gary Hon; Nigel S Walker; Loredana Lo Conte; Patrice Koehl; Michael Levitt; Steven E Brenner
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

4.  Protein complex prediction via cost-based clustering.

Authors:  A D King; N Przulj; I Jurisica
Journal:  Bioinformatics       Date:  2004-06-04       Impact factor: 6.937

Review 5.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

6.  CoryneRegNet: an ontology-based data warehouse of corynebacterial transcription factors and regulatory networks.

Authors:  Jan Baumbach; Karina Brinkrolf; Lisa F Czaja; Sven Rahmann; Andreas Tauch
Journal:  BMC Genomics       Date:  2006-02-14       Impact factor: 3.969

7.  Large scale hierarchical clustering of protein sequences.

Authors:  Antje Krause; Jens Stoye; Martin Vingron
Journal:  BMC Bioinformatics       Date:  2005-01-22       Impact factor: 3.169

8.  SCOP database in 2004: refinements integrate structure and sequence family data.

Authors:  Antonina Andreeva; Dave Howorth; Steven E Brenner; Tim J P Hubbard; Cyrus Chothia; Alexey G Murzin
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

9.  Spectral clustering of protein sequences.

Authors:  Alberto Paccanaro; James A Casbon; Mansoor A S Saqi
Journal:  Nucleic Acids Res       Date:  2006-03-17       Impact factor: 16.971

10.  The COG database: an updated version includes eukaryotes.

Authors:  Roman L Tatusov; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Boris Kiryutin; Eugene V Koonin; Dmitri M Krylov; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal:  BMC Bioinformatics       Date:  2003-09-11       Impact factor: 3.169

View more
  24 in total

1.  Partitioning biological data with transitivity clustering.

Authors:  Tobias Wittkop; Dorothea Emig; Sita Lange; Sven Rahmann; Mario Albrecht; John H Morris; Sebastian Böcker; Jens Stoye; Jan Baumbach
Journal:  Nat Methods       Date:  2010-06       Impact factor: 28.547

2.  Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution.

Authors:  Leonard Apeltsin; John H Morris; Patricia C Babbitt; Thomas E Ferrin
Journal:  Bioinformatics       Date:  2010-11-29       Impact factor: 6.937

3.  Comparing the performance of biomedical clustering methods.

Authors:  Christian Wiwie; Jan Baumbach; Richard Röttger
Journal:  Nat Methods       Date:  2015-09-21       Impact factor: 28.547

4.  Integrated analysis and reconstruction of microbial transcriptional gene regulatory networks using CoryneRegNet.

Authors:  Jan Baumbach; Tobias Wittkop; Christiane Katja Kleindt; Andreas Tauch
Journal:  Nat Protoc       Date:  2009       Impact factor: 13.491

5.  Towards the integrated analysis, visualization and reconstruction of microbial gene regulatory networks.

Authors:  Jan Baumbach; Andreas Tauch; Sven Rahmann
Journal:  Brief Bioinform       Date:  2008-12-12       Impact factor: 11.622

6.  Comprehensive cluster analysis with Transitivity Clustering.

Authors:  Tobias Wittkop; Dorothea Emig; Anke Truss; Mario Albrecht; Sebastian Böcker; Jan Baumbach
Journal:  Nat Protoc       Date:  2011-02-10       Impact factor: 13.491

7.  Guiding biomedical clustering with ClustEval.

Authors:  Christian Wiwie; Jan Baumbach; Richard Röttger
Journal:  Nat Protoc       Date:  2018-05-24       Impact factor: 13.491

8.  Force feature spaces for visualization and classification.

Authors:  Dragana Veljkovic; Kay A Robbins
Journal:  Int Conf Digit Signal Process Proc       Date:  2008-12-11

9.  Genome-wide comparative gene family classification.

Authors:  Christian Frech; Nansheng Chen
Journal:  PLoS One       Date:  2010-10-15       Impact factor: 3.240

10.  Family classification without domain chaining.

Authors:  Jacob M Joseph; Dannie Durand
Journal:  Bioinformatics       Date:  2009-06-15       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.