| Literature DB >> 34469548 |
Jakob McBroome1,2, Bryan Thornlow1,2, Angie S Hinrichs2, Alexander Kramer1,2, Nicola De Maio3, Nick Goldman3, David Haussler1,2, Russell Corbett-Detig1,2, Yatish Turakhia1,2.
Abstract
The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations, as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils-a command-line utility for rapidly querying, interpreting, and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.Entities:
Keywords: COVID-19; SARS-CoV-2 phylogenetics; genomic surveillance
Mesh:
Year: 2021 PMID: 34469548 PMCID: PMC8662617 DOI: 10.1093/molbev/msab264
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 8.800
Fig. 1matUtils functions enable fast, user-friendly analysis of MATs. (A) An example MAT with tree topology corresponding to the MAT on the left and the mutation annotations on each node shown on the right. (B) matUtils annotate allows the user to annotate internal nodes with clade names. In this example, nodes 1 and 3 are annotated with clade names 19A and 19B, respectively. This MAT serves as an input to commands shown in panels C–F. (C) matUtils summary outputs sample-, clade-, and tree-level statistics for the input MAT. (D) matUtils extract allows users to convert an MAT to Newick format (left), subset the MAT for a specified clade (center) or mutation (right), among other functions. (E) matUtils uncertainty outputs parsimony scores, equally parsimonious placements and neighborhood sizes for each sample of an input MAT. Sample B has two equally parsimonious placements, as it could also be placed as a descendant of node 5 with terminal mutations C2G, A4U, and G5C. (F) matUtils introduce can take a list of samples of interest as input and output the largest monophyletic clade and regional association index associated with the input population, along with their predicted introduction nodes and paths. In all panels, user input commands are shown in large fonts (e.g., “matUtils annotate”) and output text from these commands is shown in monospaced fonts.
Fig. 2matUtils can generate informative visuals with Auspice. The above trees represent a clade of related B.1.1.7 samples from the United States which secondarily acquired the potentially important spike protein mutation E484K, which is caused by the nucleotide mutation G23012A. These trees were obtained by running the command “matUtils extract -i public-2021-06-09.all.masked.nextclade.pangolin.pb.gz -c B.1.1.7 -m G23012A -H ‘(USA.*)’ -N 500 -j clade_trees -d clade_out,” which selects all samples from clade B.1.1.7 which acquired this mutation and are from the United States, then identifies the minimum set of 500 sample subtrees which contain all of these samples, creating an Auspice v2 format JSON for each subtree (Hadfield et al 2018). This results in 35 distinct subtree JSON files of 500 samples each in the output directory. Panel A represents the entirety of subtree six as viewed with Auspice (Hadfield et al 2018), including blue highlights and a branch label where our mutation of interest occurred. Panel B is zoomed in on this subtree and its sister clade; at this scale, we can read individual sample names and observe that this specific strain has been actively spreading in the United States during April 2021.
Fig. 3matUtils uncertainty statistics reveal low-quality sample placements. This Auspice view of an example subtree is annotated with both equally parsimonious placements (in color) and neighborhood size (branch label integers). Eighteen of our 23 samples in the subtree have a single placement and a neighborhood size of 0, indicating high placement certainty for those samples. Of the five samples with multiple equally parsimonious placements, one sample has five equally parsimonious placements with an NSS value of 19, indicating a high level of placement uncertainty for this sample spanning a relatively large neighborhood.