| Literature DB >> 35731204 |
Cheng Ye1, Bryan Thornlow2,3, Angie Hinrichs3, Alexander Kramer2,3, Cade Mirchandani2,3, Devika Torvi4, Robert Lanfear5, Russell Corbett-Detig2,3, Yatish Turakhia1.
Abstract
MOTIVATION: Phylogenetic tree optimization is necessary for precise analysis of evolutionary and transmission dynamics, but existing tools are inadequate for handling the scale and pace of data produced during the COVID-19 pandemic. One transformative approach, online phylogenetics, aims to incrementally add samples to an ever-growing phylogeny, but there are no previously-existing approaches that can efficiently optimize this vast phylogeny under the time constraints of the pandemic.Entities:
Year: 2022 PMID: 35731204 PMCID: PMC9344837 DOI: 10.1093/bioinformatics/btac401
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.An illustration of the matOptimize algorithm. (A) A flowchart of the different algorithmic stages in matOptimize. (B) An example of how matOptimize estimates the parsimony score improvement achievable from a single SPR move using a small number of steps without redoing the entire Fitch algorithm (Fitch, 1971). Each node of this tree has an integer label (1–12) and is annotated with its Fitch and boundary allele sets (Section 2). This example evaluates a move in which the subtree rooted at node 12 is pruned from node 9 and regrafted at node 4. For this move, the alleles in the Fitch set of node 12 (i.e. the single allele C) must be decremented at node 9 (during pruning) and incremented at node 4 (during regrafting). Since C is the only allele in the Fitch set of node 9, the alleles from the boundary allele set of node 9 get added to its Fitch set, resulting in an updated Fitch set {A, C}, with a lower major allele count. The change in Fitch set of node 9 is propagated upwards to its parent, i.e. to root node 1. During the pruning step, the decrement of the major allele count at node 9 has no effect on the parsimony score since it is offset by the decrement in the children count. During regrafting, since allele C, which is already present in the Fitch set of node 4, is incremented, its major allele count is now higher than the remaining alleles in the Fitch set, i.e. allele A, which is decremented. This change is propagated upwards, i.e. to parent node 2, but has no effect on it since allele A is not present in its Fitch set. Since the regrafting step also does not change the parsimony, the net parsimony score change of this move is 0. (C) Storage requirements for a mutation in the MAT data structure of matOptimize. This is a modified version of the original MAT proposed in UShER (Turakhia et al., 2021b) in order to maintain auxiliary information (such as Fitch and boundary allele sets) for performing optimization. Each mutation in matOptimize is stored compactly using only 8 bytes, which helps it maintain a small memory footprint overall. (D) An example phylogenetic tree (left) and its corresponding index tree (right). The index tree is used to speed up the search for promising destination nodes for SPR moves from a single source node via search space pruning (Section 2). Each node in the phylogenetic tree is annotated with (i) a pre-order traversal index, (ii) the depth of the node, (iii) the final allele assignment and (iv) the sensitive allele (Section 2). The index tree uses a B-Tree (Cormen, 2009; Knuth, 2011) to store the nodes at which the sensitive alleles are found (Section 2). Each node in the index tree corresponds to one node in the phylogenetic tree and stores the DFS index range of the subtree that the node covers and the minimum depth within the subtree at which a sensitive allele may be encountered. ‘N/A’ implies that the allele is not present in the subtree
Fig. 2.Comparison of parsimony score improvement and peak memory requirement of matOptimize and TNT starting from an SARS-CoV-2-based UShER-derived (A) 100K-sample tree, (B) 1M-sample tree and (C) 3M-sample tree. For (C), the peak memory requirement is not shown since TNT did not begin the optimization phase by the time it was terminated after 24 h of execution
Fig. 3.Performance and memory scaling efficiency of matOptimize using the 1M-sample tree. (A) Strong multi-node scaling efficiency of matOptimize. Each node is a GCP instance e2-highcpu-32 consisting of 32 vCPUs and 32 GB memory. The number above each data point corresponds to the actual runtime in minutes: seconds format. (B) The peak memory requirement of matOptimize is small (below 10 GB) and remains roughly constant with the number of CPU threads. This allows matOptimize to exploit all available parallelism on a multicore CPU instance without being limited by the available memory. (C) In comparison, the peak memory requirement of TNT is large (>500 GB) and increases linearly as the parallelism is increased. This limits the amount of parallelism that TNT can exploit—in the example shown, TNT could exploit only up to 8 available vCPUs out of the 40 available on the memory-optimized GCP instance m1-ultramem-40 before running out of memory