Literature DB >> 31437182

TreeCluster: Clustering biological sequences using phylogenetic trees.

Metin Balaban1, Niema Moshiri1, Uyen Mai2, Xingfan Jia3, Siavash Mirarab4.   

Abstract

Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.

Entities:  

Mesh:

Year:  2019        PMID: 31437182      PMCID: PMC6705769          DOI: 10.1371/journal.pone.0221068

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


  45 in total

1.  The RDP-II (Ribosomal Database Project).

Authors:  B L Maidak; J R Cole; T G Lilburn; C T Parker; P R Saxman; R J Farris; G M Garrity; G J Olsen; T M Schmidt; J M Tiedje
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  An efficient algorithm for large-scale detection of protein families.

Authors:  A J Enright; S Van Dongen; C A Ouzounis
Journal:  Nucleic Acids Res       Date:  2002-04-01       Impact factor: 16.971

3.  HIV-1 pol gene variation is sufficient for reconstruction of transmissions in the era of antiretroviral therapy.

Authors:  Stéphane Hué; Jonathan P Clewley; Patricia A Cane; Deenan Pillay
Journal:  AIDS       Date:  2004-03-26       Impact factor: 4.177

4.  Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness.

Authors:  Patrick D Schloss; Jo Handelsman
Journal:  Appl Environ Microbiol       Date:  2005-03       Impact factor: 4.792

5.  Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB.

Authors:  T Z DeSantis; P Hugenholtz; N Larsen; M Rojas; E L Brodie; K Keller; T Huber; D Dalevi; P Hu; G L Andersen
Journal:  Appl Environ Microbiol       Date:  2006-07       Impact factor: 4.792

6.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors:  Weizhong Li; Adam Godzik
Journal:  Bioinformatics       Date:  2006-05-26       Impact factor: 6.937

7.  Multiple alignment by aligning alignments.

Authors:  Travis J Wheeler; John D Kececioglu
Journal:  Bioinformatics       Date:  2007-07-01       Impact factor: 6.937

8.  Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees.

Authors:  Kevin Liu; Sindhu Raghavan; Serita Nelesen; C Randal Linder; Tandy Warnow
Journal:  Science       Date:  2009-06-19       Impact factor: 47.728

9.  Recent developments in the MAFFT multiple sequence alignment program.

Authors:  Kazutaka Katoh; Hiroyuki Toh
Journal:  Brief Bioinform       Date:  2008-03-27       Impact factor: 11.622

10.  OrthoMCL: identification of ortholog groups for eukaryotic genomes.

Authors:  Li Li; Christian J Stoeckert; David S Roos
Journal:  Genome Res       Date:  2003-09       Impact factor: 9.043

View more
  28 in total

1.  Cellular assays identify barriers impeding iron-sulfur enzyme activity in a non-native prokaryotic host.

Authors:  Francesca D'Angelo; Elena Fernández-Fueyo; Pierre Simon Garcia; Helena Shomar; Frédéric Barras; Gregory Bokinsky; Martin Pelosse; Rita Rebelo Manuel; Ferhat Büke; Siyi Liu; Niels van den Broek; Nicolas Duraffourg; Carol de Ram; Martin Pabst; Emmanuelle Bouveret; Simonetta Gribaldo; Béatrice Py; Sandrine Ollagnier de Choudens
Journal:  Elife       Date:  2022-03-04       Impact factor: 8.140

2.  Genomic adaptations of Campylobacter jejuni to long-term human colonization.

Authors:  Samuel J Bloomfield; Anne C Midwinter; Patrick J Biggs; Nigel P French; Jonathan C Marshall; David T S Hayman; Philip E Carter; Alison E Mather; Ahmed Fayaz; Craig Thornley; David J Kelly; Jackie Benschop
Journal:  Gut Pathog       Date:  2021-12-10       Impact factor: 4.181

3.  Assembling a Reference Phylogenomic Tree of Bacteria and Archaea by Summarizing Many Gene Phylogenies.

Authors:  Qiyun Zhu; Siavash Mirarab
Journal:  Methods Mol Biol       Date:  2022

4.  Understanding drivers of phylogenetic clustering and terminal branch lengths distribution in epidemics of Mycobacterium tuberculosis.

Authors:  Fabrizio Menardo
Journal:  Elife       Date:  2022-06-28       Impact factor: 8.713

5.  AncestralClust: Clustering of Divergent Nucleotide Sequences by Ancestral Sequence Reconstruction using Phylogenetic Trees.

Authors:  Lenore Pipes; Rasmus Nielsen
Journal:  Bioinformatics       Date:  2021-10-20       Impact factor: 6.931

6.  Cov2clusters: genomic clustering of SARS-CoV-2 sequences.

Authors:  Benjamin Sobkowiak; Kimia Kamelian; James E A Zlosnik; John Tyson; Anders Gonçalves da Silva; Linda M N Hoang; Natalie Prystajecky; Caroline Colijn
Journal:  BMC Genomics       Date:  2022-10-19       Impact factor: 4.547

7.  MetaLogo: a heterogeneity-aware sequence logo generator and aligner.

Authors:  Yaowen Chen; Zhen He; Yahui Men; Guohua Dong; Shuofeng Hu; Xiaomin Ying
Journal:  Brief Bioinform       Date:  2022-03-10       Impact factor: 11.622

8.  HIV Care Prioritization Using Phylogenetic Branch Length.

Authors:  Niema Moshiri; Davey M Smith; Siavash Mirarab
Journal:  J Acquir Immune Defic Syndr       Date:  2021-04-15       Impact factor: 3.771

9.  Phylogenetic analysis of SARS-CoV-2 genomes in Turkey.

Authors:  Ogün Adebalİ; Aylin Bİrcan; Defne Çİrcİ; Burak İŞlek; Zeynep KilinÇ; Berkay SelÇuk; Berk Turhan
Journal:  Turk J Biol       Date:  2020-06-21

10.  Global Distribution and Evolution of Mycobacterium bovis Lineages.

Authors:  Cristina Kraemer Zimpel; José Salvatore L Patané; Aureliano Coelho Proença Guedes; Robson F de Souza; Taiana T Silva-Pereira; Naila C Soler Camargo; Antônio F de Souza Filho; Cássia Y Ikuta; José Soares Ferreira Neto; João Carlos Setubal; Marcos Bryan Heinemann; Ana Marcia Sa Guimaraes
Journal:  Front Microbiol       Date:  2020-05-07       Impact factor: 5.640

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.