| Literature DB >> 34927180 |
Cheng Ye1, Bryan Thornlow2,3, Alexander Kramer2,3, Jakob McBroome2,3, Angie Hinrichs3, Russell Corbett-Detig2,3, Yatish Turakhia1.
Abstract
Phylogenetics has been central to the genomic surveillance, epidemiology and contact tracing efforts during the COVD-19 pandemic. But the massive scale of genomic sequencing has rendered the pre-pandemic tools inadequate for comprehensive phylogenetic analyses. Here, we discuss the phylogenetic package that we developed to address the needs imposed by this pandemic. The package incorporates several pandemic-specific optimization and parallelization techniques and comprises four programs: UShER, matOptimize, RIPPLES and matUtils. Using high-performance computing, UShER and matOptimize maintain and refine daily a massive mutation-annotated phylogenetic tree consisting of all SARS-CoV-2 sequences available in online repositories. With UShER and RIPPLES, individual labs - even with modest compute resources - incorporate newly-sequenced SARS-CoV-2 genomes on this phylogeny and discover evidence for recombination in real-time. With matUtils, they rapidly query and visualize massive SARS-CoV-2 phylogenies. These tools have empowered scientists worldwide to study the SARS-CoV-2 evolution and transmission at an unprecedented scale, resolution and speed.Entities:
Year: 2021 PMID: 34927180 PMCID: PMC8679213 DOI: 10.1101/2021.12.03.470766
Source DB: PubMed Journal: bioRxiv
Figure 1:Innovative optimizations realized in (A) UShER, (B) matOptimize and (C) RIPPLES for phylogenetic placement, tree optimization and recombination detection, respectively. The left side shows a representative illustration of the prior approaches and the right side illustrates the approach used in our tools.
Figure 3:Comparison of our phylogenetic package with previous state-of-the-art tools for (A) phylogenetic placement, (B) tree optimization and (C) recombination detection. Our tools achieve large improvements in runtime (left) as well as peak memory requirements (right).
Figure 2:For parallelizing phylogenetic placement over multiple CPU nodes, we split up the VCF containing new samples uniformly and distribute them over independent CPU nodes, each executing UShER to place the corresponding samples on the base mutation-annotated tree (MAT). The resulting MAT files are then merged using a parallel reduction tree of matUtils merge into a single output MAT containing new samples.
Figure 4:Strong scaling analysis for (A) UShER, (B) matOptimize and (C) RIPPLES.
Figure 5:Total time (in orange) to place 100K new samples on the 1M-sample tree using UShER followed by parallel reduction using matUtils merge, with the merge component shown separately (in blue), for different levels of parallelism.
Weak scaling analysis for (A) UShER, (B) matOptimize and (C) RIPPLES.
| A. UShER | ||
|---|---|---|
| vCPU | Samples placed | Time |
| 64 | 6.25K | 26m 48s |
| 128 | 12.5K | 28m 22s |
| 256 | 25K | 30m 41s |
| 512 | 50K | 33m 36s |
| 1024 | 100K | 37m 07s |
| B. matOptimize | ||
| vCPU | Source nodes explored | Time |
| 64 | 39789 | 10m 45s |
| 128 | 79577 | 11m 54s |
| 256 | 159154 | 11m 51s |
| 512 | 318308 | 11m 58s |
| 1024 | 636616 | 11m 30s |
| C. RIPPLES | ||
| vCPU | Long branches explored | Time |
| 64 | 587 | 49m 29s |
| 128 | 1174 | 53m 33s |
| 256 | 2348 | 54m 33s |
| 512 | 4696 | 56m 18s |
| 1024 | 9391 | 55m 52s |
| Performance Attribute | Our Submission |
|---|---|
| Category of achievement | peak performance, scalability |
| Type of method used | N/A |
| Results reported on the basis of | whole application including I/O |
| Precision reported | N/A |
| System scale | results measured on full-scale system |
| Measurement mechanism | timers |