| Literature DB >> 34132752 |
Andreas Rempel1,2,3, Roland Wittler1,2.
Abstract
SUMMARY: SANS serif is a novel software for alignment-free, whole-genome based phylogeny estimation that follows a pangenomic approach to efficiently calculate a set of splits in a phylogenetic tree or network.Entities:
Year: 2021 PMID: 34132752 PMCID: PMC8665756 DOI: 10.1093/bioinformatics/btab444
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.SANS serif methodology and evaluation. (A) One sequence file (FASTA/FASTQ) per genome/assembly is provided as input. For each file, the k-mers are extracted, transformed into bit vectors using a 2-bit encoding per character and stored in a hash table. (B) For each k-mer, another bit vector is stored to encode its presence or absence: a one (or zero) at position i indicates its presence (resp. absence) in the ith input file. Presence/absence patterns are combined to splits and stored in a second hash table together with two counts: the number of k-mers having that pattern and the number of k-mers having the complementary pattern. (C) Both counts are combined to an overall weight per split, e.g. using the geometric mean. Splits can be filtered and visualized as a network. As an example, a subnet of a weakly compatible split set of the Salmonella dataset (Zhou ) is shown. Splits are represented by parallel edges. (D and E) Runtime (user time) and peak memory usage of SANS on the Salmonella dataset on a single 2 GHz processor. For random subsamples of n assemblies and , the 10n highest weighting splits were output. Values were averaged over processing three random subsamples each. (F) Accuracy w.r.t. the reference tree. Precision: (number of called splits also in the reference tree)/(number of all called splits), Recall: (number of reference splits also in the call set)/(number of all reference splits). Trivial splits, i.e. splits separating single leaves, are not considered. Filters: only considering the x highest weighting splits for increasing x (line), greedily extracting a subset that corresponds to a tree (strict), additionally keeping a second subset that corresponds to a tree (2-tree), greedily extracting a subset that is weakly compatible (weakly). For comparison, the accuracy of Mashtree (Katz ) with default parameters as well as a reconciled tree of 3002 core genes (Zhou , Fig. 2A, supertree 2) is included, emphasizing the complexity of the dataset and the sensibility of the applied measure.