Literature DB >> 18493791

Performance comparison of gene family clustering methods with expert curated gene family data set in Arabidopsis thaliana.

Kuan Yang1, Liqing Zhang.   

Abstract

With the exponential growth of genomics data, the demand for reliable clustering methods is increasing every day. Despite the wide usage of many clustering algorithms, the accuracy of these algorithms has been evaluated mostly on simulated data sets and seldom on real biological data for which a "correct answer" is available. In order to address this issue, we use the manually curated high-quality Arabidopsis thaliana gene family database as a "gold standard" to conduct a comprehensive comparison of the accuracies of four widely used clustering methods including K-means, TribeMCL, single-linkage clustering and complete-linkage clustering. We compare the results from running different clustering methods on two matrices: the E-value matrix and the k-tuple distance matrix. The E-value matrix is computed based on BLAST E-values. The k-tuple distance matrix is computed based on the difference in tuple frequencies. The TribeMCL with the E-value matrix performed best, with the Inflation parameter (=1.15) tuned considerably lower than what has been suggested previously (=2). The single-linkage clustering method with the E-value matrix was second best. Single-linkage clustering, K-means clustering, complete-linkage clustering, and TribeMCL with a k-tuple distance matrix performed reasonably well. Complete-linkage clustering with the k-tuple distance matrix performed the worst.

Entities:  

Mesh:

Year:  2008        PMID: 18493791     DOI: 10.1007/s00425-008-0748-7

Source DB:  PubMed          Journal:  Planta        ISSN: 0032-0935            Impact factor:   4.116


  9 in total

1.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families.

Authors:  G Yona; N Linial; M Linial
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  The SYSTERS protein sequence cluster set.

Authors:  A Krause; J Stoye; M Vingron
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

3.  Predictions of gene family distributions in microbial genomes: evolution by gene duplication and modification.

Authors:  I Yanai; C J Camacho; C DeLisi
Journal:  Phys Rev Lett       Date:  2000-09-18       Impact factor: 9.161

4.  ProClust: improved clustering of protein sequences with an extended graph-based approach.

Authors:  P Pipenbacher; A Schliep; S Schneckener; A Schönhuth; D Schomburg; R Schrader
Journal:  Bioinformatics       Date:  2002       Impact factor: 6.937

5.  Evaluation and comparison of gene clustering methods in microarray analysis.

Authors:  Anbupalam Thalamuthu; Indranil Mukhopadhyay; Xiaojing Zheng; George C Tseng
Journal:  Bioinformatics       Date:  2006-07-31       Impact factor: 6.937

6.  A measure of the similarity of sets of sequences not requiring sequence alignment.

Authors:  B E Blaisdell
Journal:  Proc Natl Acad Sci U S A       Date:  1986-07       Impact factor: 11.205

Review 7.  TAIR: a resource for integrated Arabidopsis data.

Authors:  Margarita Garcia-Hernandez; Tanya Z Berardini; Guanghong Chen; Debbie Crist; Aisling Doyle; Eva Huala; Emma Knee; Mark Lambrecht; Neil Miller; Lukas A Mueller; Suparna Mundodi; Leonore Reiser; Seung Y Rhee; Randy Scholl; Julie Tacklind; Dan C Weems; Yihe Wu; Iris Xu; Daniel Yoo; Jungwon Yoon; Peifen Zhang
Journal:  Funct Integr Genomics       Date:  2002-10-03       Impact factor: 3.410

8.  Spectral clustering of protein sequences.

Authors:  Alberto Paccanaro; James A Casbon; Mansoor A S Saqi
Journal:  Nucleic Acids Res       Date:  2006-03-17       Impact factor: 16.971

9.  PlantTribes: a gene and gene family resource for comparative genomics in plants.

Authors:  P Kerr Wall; Jim Leebens-Mack; Kai F Müller; Dawn Field; Naomi S Altman; Claude W dePamphilis
Journal:  Nucleic Acids Res       Date:  2007-12-10       Impact factor: 16.971

  9 in total
  2 in total

1.  A novel hierarchical clustering algorithm for gene sequences.

Authors:  Dan Wei; Qingshan Jiang; Yanjie Wei; Shengrui Wang
Journal:  BMC Bioinformatics       Date:  2012-07-23       Impact factor: 3.169

2.  GoMapMan: integration, consolidation and visualization of plant gene annotations within the MapMan ontology.

Authors:  Živa Ramsak; Špela Baebler; Ana Rotter; Matej Korbar; Igor Mozetic; Björn Usadel; Kristina Gruden
Journal:  Nucleic Acids Res       Date:  2013-11-04       Impact factor: 16.971

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.