Literature DB >> 23843314

Minimizing the average distance to a closest leaf in a phylogenetic tree.

Frederick A Matsen1, Aaron Gallagher, Connor O McCoy.   

Abstract

When performing an analysis on a collection of molecular sequences, it can be convenient to reduce the number of sequences under consideration while maintaining some characteristic of a larger collection of sequences. For example, one may wish to select a subset of high-quality sequences that represent the diversity of a larger collection of sequences. One may also wish to specialize a large database of characterized "reference sequences" to a smaller subset that is as close as possible on average to a collection of "query sequences" of interest. Such a representative subset can be useful whenever one wishes to find a set of reference sequences that is appropriate to use for comparative analysis of environmentally derived sequences, such as for selecting "reference tree" sequences for phylogenetic placement of metagenomic reads. In this article, we formalize these problems in terms of the minimization of the Average Distance to the Closest Leaf (ADCL) and investigate algorithms to perform the relevant minimization. We show that the greedy algorithm is not effective, show that a variant of the Partitioning Around Medoids (PAM) heuristic gets stuck in local minima, and develop an exact dynamic programming approach. Using this exact program we note that the performance of PAM appears to be good for simulated trees, and is faster than the exact algorithm for small trees. On the other hand, the exact program gives solutions for all numbers of leaves less than or equal to the given desired number of leaves, whereas PAM only gives a solution for the prespecified number of leaves. Via application to real data, we show that the ADCL criterion chooses chimeric sequences less often than random subsets, whereas the maximization of phylogenetic diversity chooses them more often than random. These algorithms have been implemented in publicly available software.

Mesh:

Year:  2013        PMID: 23843314      PMCID: PMC3797636          DOI: 10.1093/sysbio/syt044

Source DB:  PubMed          Journal:  Syst Biol        ISSN: 1063-5157            Impact factor:   15.683


  16 in total

1.  Phylogenetic diversity and the greedy algorithm.

Authors:  Mike Steel
Journal:  Syst Biol       Date:  2005-08       Impact factor: 15.683

2.  Phylogenetic diversity within seconds.

Authors:  Bui Quang Minh; Steffen Klaere; Arndt von Haeseler
Journal:  Syst Biol       Date:  2006-10       Impact factor: 15.683

3.  Identification of HIV superinfection in seroconcordant couples in Rakai, Uganda, by use of next-generation deep sequencing.

Authors:  Andrew D Redd; Aleisha Collinson-Streng; Craig Martens; Stacy Ricklefs; Caroline E Mullis; Jordyn Manucci; Aaron A R Tobian; Ethan J Selig; Oliver Laeyendecker; Nelson Sewankambo; Ronald H Gray; David Serwadda; Maria J Wawer; Stephen F Porcella; Thomas C Quinn
Journal:  J Clin Microbiol       Date:  2011-06-22       Impact factor: 5.948

Review 4.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis.

Authors:  J A Eisen
Journal:  Genome Res       Date:  1998-03       Impact factor: 9.043

5.  FastTree 2--approximately maximum-likelihood trees for large alignments.

Authors:  Morgan N Price; Paramvir S Dehal; Adam P Arkin
Journal:  PLoS One       Date:  2010-03-10       Impact factor: 3.240

Review 6.  Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions.

Authors:  J A Eisen; K S Sweder; P C Hanawalt
Journal:  Nucleic Acids Res       Date:  1995-07-25       Impact factor: 16.971

7.  A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea.

Authors:  Dongying Wu; Philip Hugenholtz; Konstantinos Mavromatis; Rüdiger Pukall; Eileen Dalin; Natalia N Ivanova; Victor Kunin; Lynne Goodwin; Martin Wu; Brian J Tindall; Sean D Hooper; Amrita Pati; Athanasios Lykidis; Stefan Spring; Iain J Anderson; Patrik D'haeseleer; Adam Zemla; Mitchell Singer; Alla Lapidus; Matt Nolan; Alex Copeland; Cliff Han; Feng Chen; Jan-Fang Cheng; Susan Lucas; Cheryl Kerfeld; Elke Lang; Sabine Gronow; Patrick Chain; David Bruce; Edward M Rubin; Nikos C Kyrpides; Hans-Peter Klenk; Jonathan A Eisen
Journal:  Nature       Date:  2009-12-24       Impact factor: 49.962

8.  UCHIME improves sensitivity and speed of chimera detection.

Authors:  Robert C Edgar; Brian J Haas; Jose C Clemente; Christopher Quince; Rob Knight
Journal:  Bioinformatics       Date:  2011-06-23       Impact factor: 6.937

9.  Species choice for comparative genomics: being greedy works.

Authors:  Fabio Pardi; Nick Goldman
Journal:  PLoS Genet       Date:  2005-12-02       Impact factor: 5.917

10.  Chronic HIV-1 infection frequently fails to protect against superinfection.

Authors:  Anne Piantadosi; Bhavna Chohan; Vrasha Chohan; R Scott McClelland; Julie Overbaugh
Journal:  PLoS Pathog       Date:  2007-11       Impact factor: 6.823

View more
  5 in total

1.  Nestly--a framework for running software with nested parameter choices and aggregating results.

Authors:  Connor O McCoy; Aaron Gallagher; Noah G Hoffman; Frederick A Matsen
Journal:  Bioinformatics       Date:  2012-12-06       Impact factor: 6.937

2.  Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf.

Authors:  Jonathan T L Kang; Peng Zhang; Sebastian Zöllner; Noah A Rosenberg
Journal:  Genetics       Date:  2015-08-24       Impact factor: 4.562

3.  Construction of a Species-Level Tree of Life for the Insects and Utility in Taxonomic Profiling.

Authors:  Douglas Chesters
Journal:  Syst Biol       Date:  2017-05-01       Impact factor: 15.683

4.  HIV-1 superinfection occurs less frequently than initial infection in a cohort of high-risk Kenyan women.

Authors:  Keshet Ronen; Connor O McCoy; Frederick A Matsen; David F Boyd; Sandra Emery; Katherine Odem-Davis; Walter Jaoko; Kishor Mandaliya; R Scott McClelland; Barbra A Richardson; Julie Overbaugh
Journal:  PLoS Pathog       Date:  2013-08-29       Impact factor: 6.823

5.  Geographically-stratified HIV-1 group M pol subtype and circulating recombinant form sequences.

Authors:  Soo-Yon Rhee; Robert W Shafer
Journal:  Sci Data       Date:  2018-07-31       Impact factor: 6.444

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.