| Literature DB >> 29716518 |
Fabrizio Menardo1,2, Chloé Loiseau3,4, Daniela Brites3,4, Mireia Coscolla3,4, Sebastian M Gygli3,4, Liliana K Rutaihwa3,4,5, Andrej Trauner3,4, Christian Beisel6, Sonia Borrell3,4, Sebastien Gagneux7,8.
Abstract
BACKGROUND: Large sequence datasets are difficult to visualize and handle. Additionally, they often do not represent a random subset of the natural diversity, but the result of uncoordinated and convenience sampling. Consequently, they can suffer from redundancy and sampling biases.Entities:
Keywords: Biogeography; Clone elimination; Influenza; Large phylogenetic trees; Redundancy reduction; Representative sample; Sampling bias; Size reduction; Tuberculosis
Mesh:
Year: 2018 PMID: 29716518 PMCID: PMC5930393 DOI: 10.1186/s12859-018-2164-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The core routine of Treemmer (with -r = 1): at each iteration the pair of closest leaves is identified and one of the two leaves is pruned from the tree, minimizing the loss of diversity.
Fig. 2Plot of the relative tree length decay for the MTB dataset. Four different analysis were run with -r = 1 (black dots), -r = 10 (red dots) , -r = 100 (blue dots) and -r = 1000 (green dots). The slow decay of the RTL is due to the high redundancy of the dataset. The RTL decays for -r = 1, 10 and 100 are overlapping and indistinguishable.
Fig. 3Comparison of original (a) and reduced (b) tree of the MTB dataset, with 10,303 and 4,919 leaves, respectively. The scale bar indicates expected substitution per position (only polymorphic nucleotide positions were included in the alignment). The different lineages of TB are labeled (Maf : Mycobacterium africanum (L5 and L6) + animal lineages).
Fig. 4Comparison of original (a) and reduced (b) tree of the influenza A virus dataset, with 2,080 and 250 leaves, respectively. The scale bar indicates years.
Fig. 5Plot of the relative tree length decay for the influenza A virus dataset. Three different analysis were run with -r = 1 (black dots), -r = 10 (red dots) and -r = 100 (blue dots). For this dataset the decay was faster than for the MTB dataset. This is due to the different structure of the phylogenetic trees and to the reduced redundancy of the viral dataset.