Literature DB >> 34469548

A Daily-Updated Database and Tools for Comprehensive SARS-CoV-2 Mutation-Annotated Trees.

Jakob McBroome^1,2, Bryan Thornlow^1,2, Angie S Hinrichs², Alexander Kramer^1,2, Nicola De Maio³, Nick Goldman³, David Haussler^1,2, Russell Corbett-Detig^1,2, Yatish Turakhia^1,2.

Abstract

The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations, as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils-a command-line utility for rapidly querying, interpreting, and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.

Entities: Chemical

Keywords: COVID-19; SARS-CoV-2 phylogenetics; genomic surveillance

Mesh：

Year: 2021 PMID： 34469548 PMCID： PMC8662617 DOI： 10.1093/molbev/msab264

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 8.800

The COVID-19 pandemic has inspired unprecedented levels of genome sequencing for a single pathogen (Hodcroft et al. 2021). Over a million SARS-CoV-2 genomes have been sequenced worldwide so far, and tens of thousands of new genomes are getting uploaded daily (Maxmen 2021). These data have enabled scientists to closely track the evolution and transmission dynamics of the virus at global and local scales (Deng et al. 2020; Chaillon and Smith 2021; da Silva Filipe et al. 2021). However, the scale of these data is posing serious computational challenges for comprehensive phylogenetic analyses (Hodcroft et al. 2021). Platforms like Nextstrain (Hadfield et al. 2018) have been invaluable in studying viral transmission networks and genomic surveillance efforts, but they only provide subsampled SARS-CoV-2 trees consisting of a tiny fraction of available data, omitting phylogenetic relationships with most available sequences. A single, comprehensive SARS-CoV-2 reference tree of all available data could not only facilitate detailed and unambiguous phylogenetic analyses at global, country, and local levels but may also help promote consistency of results across different research groups (Turakhia et al. 2020). The massive volume of SARS-CoV-2 data also poses numerous data sharing challenges with existing file formats, such as Fasta or Variant Call Format (VCF), which are bulky and necessitate network speeds and computational capabilities that are beyond the reach of many research and scientific groups.

New Approaches

In this work, we simultaneously address the issue of maintaining a comprehensive SARS-CoV-2 reference tree and its associated data processing, sharing, and analysis challenges. Specifically, we are maintaining and openly sharing a daily-updated database of mutation-annotated trees (MATs) containing global SARS-CoV-2 sequences from public databases, without any downsampling (other than for quality control, see supplementary methods, Supplementary Material online), including annotations for Nextstrain clades (Hadfield et al. 2018) and Pango lineages (Rambaut et al. 2020; supplementary fig. 1, Supplementary Material online). The MAT is an extremely efficient data format proposed recently (Turakhia, Thornlow, Hinrichs, De Maio, et al. 2021 ) which uses a form of phylogenetic compression (Ané and Sanderson 2005) to facilitate sharing of extremely large genome sequence data sets. An uncompressed MAT of 834,521 SARS-CoV-2 public sequences requires only 65 MB to store and encodes more information than in a 43 GB VCF containing a single-nucleotide variation of all sequences (the MAT format does not handle insertions and deletions [Turakhia, Thornlow, Hinrichs, De Maio, et al. 2021]) and a 38 MB Newick file containing the phylogenetic tree topology. To accompany this database, we present matUtils—a toolkit for rapidly querying, interpreting, and manipulating the MATs included in our database or constructed with UShER (Turakhia, Thornlow, Hinrichs, De Maio, et al. 2021). Using matUtils, common operations in genomic surveillance and contact tracing efforts, including annotating an MAT with new clades, extracting specific subtrees, or converting the MAT to standard Newick or VCF format, can be performed in a matter of seconds to minutes even on a laptop. We also provide a web interface for matUtils through the UCSC SARS-CoV-2 Genome Browser (Fernandes et al. 2020). Together, our SARS-CoV-2 database and matUtils toolkit can simultaneously democratize and accelerate pandemic-related research.

Results and Discussion

A Daily-Updated MAT Database of Global SARS-CoV-2 Sequences

To aid the scientific community studying the mutational and transmission dynamics of the SARS-CoV-2 virus and its different variants, we are maintaining a daily-updated database of SARS-CoV-2 MATs composed of public data. Starting with the final Newick tree release dated November 13, 2020, of Rob Lanfear’s sarscov2phylo (https://github.com/roblanf/sarscov2phylo, last accessed September 6, 2021) that is rerooted to Wuhan/Hu-1 (GenBank MN908947.3, RefSeq NC_045512.2), we have set up an automated pipeline to aggregate public sequences available through GenBank (Clark et al. 2007), COG-UK (Nicholls et al. 2021), and the China National Center for Bioinformation on a daily basis and incorporate them into our MAT using UShER (supplementary methods, Supplementary Material online). GISAID data (Shu and McCauley 2017) are not included in our MATs because its usage terms do not allow redistribution. Similar to GISAID, our database is subject to the sampling bias resulting from the vast disparity in the sequencing efforts of various countries (Cyranoski 2021, supplementary fig. 1, Supplementary Material online). We also use the matUtils annotate command (supplementary methods, Supplementary Material online) to add Nextstrain clade and Pango lineage annotations to individual branches of our MAT. As of June 9, 2021, our MAT consists of 834,521 sequences, includes 14 Nextstrain clade and 895 Pango lineage annotations for all samples, and is only 65 MB, or 14 MB when gzip-compressed (supplementary fig. 1 and table S1, Supplementary Material online). To our knowledge, this is the most comprehensive representation of the SARS-CoV-2 evolutionary history using publicly available sequences. It can be freely used to study evolutionary and transmission dynamics of the virus at global, country, and local levels, and can be visualized using the Cov2Tree tool (https://cov2tree.org/, last accessed September 6, 2021) developed by Theo Sanderson.

matUtils Provides a Wide Range of Functions to Analyze and Manipulate MATs

We have created a high-performance command-line utility called matUtils for performing a wide range of operations on MATs for rapid interpretation and analysis in genomic surveillance and contact tracing efforts. matUtils is distributed with the UShER package (Turakhia, Thornlow, Hinrichs, De Maio, et al. 2021) and uses the same MAT format as UShER. matUtils is organized into five different subcommands: annotate, summary, extract, uncertainty, and introduce (fig. 1), described briefly below. We provide detailed instructions for the usage of each module on our wiki (https://usher-wiki.readthedocs.io/en/latest/matUtils.html, last accessed September 6, 2021).

Fig. 1

matUtils functions enable fast, user-friendly analysis of MATs. (A) An example MAT with tree topology corresponding to the MAT on the left and the mutation annotations on each node shown on the right. (B) matUtils annotate allows the user to annotate internal nodes with clade names. In this example, nodes 1 and 3 are annotated with clade names 19A and 19B, respectively. This MAT serves as an input to commands shown in panels C–F. (C) matUtils summary outputs sample-, clade-, and tree-level statistics for the input MAT. (D) matUtils extract allows users to convert an MAT to Newick format (left), subset the MAT for a specified clade (center) or mutation (right), among other functions. (E) matUtils uncertainty outputs parsimony scores, equally parsimonious placements and neighborhood sizes for each sample of an input MAT. Sample B has two equally parsimonious placements, as it could also be placed as a descendant of node 5 with terminal mutations C2G, A4U, and G5C. (F) matUtils introduce can take a list of samples of interest as input and output the largest monophyletic clade and regional association index associated with the input population, along with their predicted introduction nodes and paths. In all panels, user input commands are shown in large fonts (e.g., “matUtils annotate”) and output text from these commands is shown in monospaced fonts.

Annotate

This function annotates clades in an MAT. One of the central uses of phylogenetics during the pandemic is to trace the emergence and spread of new viral lineages. Nextstrain (Hadfield et al. 2018), Pango (Rambaut et al. 2020), and GISAID (Shu and McCauley 2017) provide different nomenclatures for SARS-CoV-2 variants that have been used widely in genomic surveillance. Our MAT format provides the ability to annotate internal branches of the tree with an array of clade names, one for each clade nomenclature. matUtils annotate provides two methods for annotation: 1) directly providing the mappings of each clade name to its corresponding node or 2) providing a set of representative sample names for each clade from which the clade roots can be automatically inferred (supplementary methods, Supplementary Material online). Both methods ensure that the clades remain monophyletic, but we use the second approach to label Nextstrain clades and Pango lineages in our SARS-CoV-2 MAT database since it can be automated using available data (supplementary methods, Supplementary Material online). matUtils annotate has high congruence with Nextstrain clades and Pango lineage annotations (supplementary table S1, Supplementary Material online). Once clades are annotated on an MAT, the UShER placement tool (Turakhia, Thornlow, Hinrichs, De Maio, et al. 2021) can assign each newly placed sequence to its corresponding Pango lineage. This is being used as a feature in Pangolin 3.0 (https://github.com/cov-lineages/pangolin/releases/tag/v3.0, last accessed September 6, 2021) to perform clade assignments in a fully phylogenetic framework.

Summary

This function provides a brief summary of the available data in the input MAT file and is meant to serve as a typical first step in any MAT-based analysis. It provides a count of the total number of samples in the MAT, the size of each annotated clade, the total parsimony score (i.e., the sum of mutation events on all branches of the MAT), the number of distinct mutations, phylogenetically informed translation of mutations, and other similar statistics.

Extract

Many SARS-CoV-2 phylodynamic studies involve restricting the analysis to a smaller tree of interest. Although it can be computationally challenging to identify samples most closely related to a given sample or cluster from over a million other sequences, it is straightforward to retrieve subtrees from a comprehensive phylogeny. matUtils extract provides an efficient and robust suite of options for subtree selection from an MAT. A user can use matUtils extract to subsample an MAT to find samples that contain a mutation of interest, are members of a specific clade, have a name matching a specific regular expression pattern (such as the expression “[IND*|India*]” to select samples from India), among other criteria (supplementary methods, Supplementary Material online). matUtils extract also includes options to identify from an MAT sequences which have descended from long internal branches in the tree, which can sometimes arise from recombination (Jackson et al. 2021; Turakhia, Thornlow, Hinrichs, McBroome, et al. 2021), or those with an unusually high parsimony score, which are indicative of low-quality sequences (Mai and Mirarab 2018). Notably, matUtils extract can produce an output Auspice v2 JSON that is compatible with the Auspice tree visualization tool (Hadfield et al. 2018; fig. 2, supplementary methods, Supplementary Material online). matUtils extract can also convert an MAT into other file formats, such as a Newick for its corresponding phylogenetic tree and a VCF for its corresponding genome variation data. matUtils extract also provides an option to resolve all polytomies in an MAT arbitrarily, similar to the muti2di functionality in ape (Paradis and Schliep 2019), for compatibility with phylogenetic tools that do not allow polytomies.

Fig. 2

matUtils can generate informative visuals with Auspice. The above trees represent a clade of related B.1.1.7 samples from the United States which secondarily acquired the potentially important spike protein mutation E484K, which is caused by the nucleotide mutation G23012A. These trees were obtained by running the command “matUtils extract -i public-2021-06-09.all.masked.nextclade.pangolin.pb.gz -c B.1.1.7 -m G23012A -H ‘(USA.*)’ -N 500 -j clade_trees -d clade_out,” which selects all samples from clade B.1.1.7 which acquired this mutation and are from the United States, then identifies the minimum set of 500 sample subtrees which contain all of these samples, creating an Auspice v2 format JSON for each subtree (Hadfield et al 2018). This results in 35 distinct subtree JSON files of 500 samples each in the output directory. Panel A represents the entirety of subtree six as viewed with Auspice (Hadfield et al 2018), including blue highlights and a branch label where our mutation of interest occurred. Panel B is zoomed in on this subtree and its sister clade; at this scale, we can read individual sample names and observe that this specific strain has been actively spreading in the United States during April 2021.

Uncertainty

A fundamental concern in SARS-CoV-2 phylogenetics is topological uncertainty (Hodcroft et al. 2021), which may result from contaminated sequences or sample mixtures (Turakhia, Thornlow, Hinrichs, De Maio, et al. 2021). The impact of this concern depends on the biological context of the analysis. matUtils uncertainty provides a topological uncertainty statistic that computes the number of equally parsimonious placements that exist for each specified sample in the input MAT. Importantly, matUtils also allows the user to calculate equally parsimonious positions for already placed samples. This is accomplished by pruning the sample from the tree and placing the sample back to the tree using the placement module of UShER (Turakhia, Thornlow, Hinrichs, De Maio, et al. 2021; supplementary methods, Supplementary Material online). matUtils uncertainty additionally records the number of mutations separating the two most distant equally parsimonious placements, reflecting the distribution of placements across the tree (supplementary methods, Supplementary Material online). The output file is compatible as “drag-and-drop” metadata with the Auspice platform, which allows for a rapid visualization of potentially problematic placements (fig. 3).

Fig. 3

matUtils uncertainty statistics reveal low-quality sample placements. This Auspice view of an example subtree is annotated with both equally parsimonious placements (in color) and neighborhood size (branch label integers). Eighteen of our 23 samples in the subtree have a single placement and a neighborhood size of 0, indicating high placement certainty for those samples. Of the five samples with multiple equally parsimonious placements, one sample has five equally parsimonious placements with an NSS value of 19, indicating a high level of placement uncertainty for this sample spanning a relatively large neighborhood.

Introduce

Public health officials are often concerned about the number of new introductions of the virus genome in a given country or local area. To aid this analysis, matUtils introduce can calculate the association index (Wang et al. 2001) or the maximum monophyletic clade size statistic (Salemi et al. 2005; Parker et al. 2008) for arbitrary sets of samples, along with simple heuristics for approximating points of introduction into a region (supplementary methods, Supplementary Material online).

matUtils Enables Rapid Analysis of a Comprehensive SARS-CoV-2 Global Tree and Its Web Interface

The matUtils toolkit is designed to scale efficiently to SARS-CoV-2 phylogenies containing millions of samples. Using matUtils, common pandemic-relevant operations described in the earlier section can be performed in the order of seconds to minutes with the current scale of SARS-CoV-2 data (supplementary tables S2–S9, Supplementary Material online). For example, it takes only 5 s to summarize the information contained in our June 9, 2021 SARS-CoV-2 MAT of 834,521 samples and only 15 s to extract the mutation paths from the root to every sample in the MAT (supplementary table S2, Supplementary Material online). Since matUtils is primarily designed to work with the newly proposed and information-rich MAT format, it does not have direct counterparts in other bioinformatic software packages currently, but its efficiency is similar or better than state-of-the-art tools that offer comparable functionality (supplementary tables S2–S9, Supplementary Material online). For example, matUtils is able to resolve polytomies in an 834,521 sample tree in 9 s, a task which takes over 37 min using ape (Paradis and Schliep 2019; supplementary table S3, Supplementary Material online). matUtils is also very memory-efficient, requiring less than 1.4 GB of main memory for most tasks, making it possible to run even on laptop devices. Certain functions of matUtils (such as extracting subtrees of provided sample names or identifiers) have also been ported to UCSC SARS-CoV-2 Genome Browser (Fernandes et al. 2020) and are available from https://genome.ucsc.edu/cgi-bin/hgPhyloPlace (last accessed September 6, 2021). Our database and utility fill a critical need for open, public, rapid analysis of the global SARS-CoV-2 phylogeny by health departments and research groups across the world, with highly efficient file formats that do not require high-speed internet connectivity or large storage devices, and tools capable of rapidly performing large-scale analyses on laptops.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.

21 in total

1. Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories.

Authors: Cécile Ané; Michael Sanderson
Journal: Syst Biol Date: 2005-02 Impact factor: 15.683

2. Correlating viral phenotypes with phylogeny: accounting for phylogenetic uncertainty.

Authors: Joe Parker; Andrew Rambaut; Oliver G Pybus
Journal: Infect Genet Evol Date: 2007-08-21 Impact factor: 3.342

3. Stability of SARS-CoV-2 phylogenies.

Authors: Yatish Turakhia; Nicola De Maio; Bryan Thornlow; Landen Gozashti; Robert Lanfear; Conor R Walker; Angie S Hinrichs; Jason D Fernandes; Rui Borges; Greg Slodkowicz; Lukas Weilguny; David Haussler; Nick Goldman; Russell Corbett-Detig
Journal: PLoS Genet Date: 2020-11-18 Impact factor: 5.917

4. Identification of shared populations of human immunodeficiency virus type 1 infecting microglia and tissue macrophages outside the central nervous system.

Authors: T H Wang; Y K Donaldson; R P Brettle; J E Bell; P Simmonds
Journal: J Virol Date: 2001-12 Impact factor: 5.103

5. Generation and transmission of interlineage recombinants in the SARS-CoV-2 pandemic.

Authors: Ben Jackson; Maciej F Boni; Matthew J Bull; Amy Colleran; Rachel M Colquhoun; Alistair C Darby; Sam Haldenby; Verity Hill; Anita Lucaci; John T McCrone; Samuel M Nicholls; Áine O'Toole; Nicole Pacchiarini; Radoslaw Poplawski; Emily Scher; Flora Todd; Hermione J Webster; Mark Whitehead; Claudia Wierzbicki; Nicholas J Loman; Thomas R Connor; David L Robertson; Oliver G Pybus; Andrew Rambaut
Journal: Cell Date: 2021-08-17 Impact factor: 41.582

6. Phylodynamic analysis of human immunodeficiency virus type 1 in distinct brain compartments provides a model for the neuropathogenesis of AIDS.

Authors: Marco Salemi; Susanna L Lamers; Stephanie Yu; T de Oliveira; Walter M Fitch; Michael S McGrath
Journal: J Virol Date: 2005-09 Impact factor: 5.103

7. The UCSC SARS-CoV-2 Genome Browser.

Authors: Jason D Fernandes; Angie S Hinrichs; Hiram Clawson; Jairo Navarro Gonzalez; Brian T Lee; Luis R Nassar; Brian J Raney; Kate R Rosenbloom; Santrupti Nerli; Arjun A Rao; Daniel Schmelter; Alastair Fyfe; Nathan Maulding; Ann S Zweig; Todd M Lowe; Manuel Ares; Russ Corbet-Detig; W James Kent; David Haussler; Maximilian Haeussler
Journal: Nat Genet Date: 2020-10 Impact factor: 38.330

8. Phylogenetic Analyses of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) B.1.1.7 Lineage Suggest a Single Origin Followed by Multiple Exportation Events Versus Convergent Evolution.

Authors: A Chaillon; D M Smith
Journal: Clin Infect Dis Date: 2021-12-16 Impact factor: 9.079

9. Evolution of genes and genomes on the Drosophila phylogeny.

Authors: Andrew G Clark; Michael B Eisen; Douglas R Smith; Casey M Bergman; Brian Oliver; Therese A Markow; Thomas C Kaufman; Manolis Kellis; William Gelbart; Venky N Iyer; Daniel A Pollard; Timothy B Sackton; Amanda M Larracuente; Nadia D Singh; Jose P Abad; Dawn N Abt; Boris Adryan; Montserrat Aguade; Hiroshi Akashi; Wyatt W Anderson; Charles F Aquadro; David H Ardell; Roman Arguello; Carlo G Artieri; Daniel A Barbash; Daniel Barker; Paolo Barsanti; Phil Batterham; Serafim Batzoglou; Dave Begun; Arjun Bhutkar; Enrico Blanco; Stephanie A Bosak; Robert K Bradley; Adrianne D Brand; Michael R Brent; Angela N Brooks; Randall H Brown; Roger K Butlin; Corrado Caggese; Brian R Calvi; A Bernardo de Carvalho; Anat Caspi; Sergio Castrezana; Susan E Celniker; Jean L Chang; Charles Chapple; Sourav Chatterji; Asif Chinwalla; Alberto Civetta; Sandra W Clifton; Josep M Comeron; James C Costello; Jerry A Coyne; Jennifer Daub; Robert G David; Arthur L Delcher; Kim Delehaunty; Chuong B Do; Heather Ebling; Kevin Edwards; Thomas Eickbush; Jay D Evans; Alan Filipski; Sven Findeiss; Eva Freyhult; Lucinda Fulton; Robert Fulton; Ana C L Garcia; Anastasia Gardiner; David A Garfield; Barry E Garvin; Greg Gibson; Don Gilbert; Sante Gnerre; Jennifer Godfrey; Robert Good; Valer Gotea; Brenton Gravely; Anthony J Greenberg; Sam Griffiths-Jones; Samuel Gross; Roderic Guigo; Erik A Gustafson; Wilfried Haerty; Matthew W Hahn; Daniel L Halligan; Aaron L Halpern; Gillian M Halter; Mira V Han; Andreas Heger; LaDeana Hillier; Angie S Hinrichs; Ian Holmes; Roger A Hoskins; Melissa J Hubisz; Dan Hultmark; Melanie A Huntley; David B Jaffe; Santosh Jagadeeshan; William R Jeck; Justin Johnson; Corbin D Jones; William C Jordan; Gary H Karpen; Eiko Kataoka; Peter D Keightley; Pouya Kheradpour; Ewen F Kirkness; Leonardo B Koerich; Karsten Kristiansen; Dave Kudrna; Rob J Kulathinal; Sudhir Kumar; Roberta Kwok; Eric Lander; Charles H Langley; Richard Lapoint; Brian P Lazzaro; So-Jeong Lee; Lisa Levesque; Ruiqiang Li; Chiao-Feng Lin; Michael F Lin; Kerstin Lindblad-Toh; Ana Llopart; Manyuan Long; Lloyd Low; Elena Lozovsky; Jian Lu; Meizhong Luo; Carlos A Machado; Wojciech Makalowski; Mar Marzo; Muneo Matsuda; Luciano Matzkin; Bryant McAllister; Carolyn S McBride; Brendan McKernan; Kevin McKernan; Maria Mendez-Lago; Patrick Minx; Michael U Mollenhauer; Kristi Montooth; Stephen M Mount; Xu Mu; Eugene Myers; Barbara Negre; Stuart Newfeld; Rasmus Nielsen; Mohamed A F Noor; Patrick O'Grady; Lior Pachter; Montserrat Papaceit; Matthew J Parisi; Michael Parisi; Leopold Parts; Jakob S Pedersen; Graziano Pesole; Adam M Phillippy; Chris P Ponting; Mihai Pop; Damiano Porcelli; Jeffrey R Powell; Sonja Prohaska; Kim Pruitt; Marta Puig; Hadi Quesneville; Kristipati Ravi Ram; David Rand; Matthew D Rasmussen; Laura K Reed; Robert Reenan; Amy Reily; Karin A Remington; Tania T Rieger; Michael G Ritchie; Charles Robin; Yu-Hui Rogers; Claudia Rohde; Julio Rozas; Marc J Rubenfield; Alfredo Ruiz; Susan Russo; Steven L Salzberg; Alejandro Sanchez-Gracia; David J Saranga; Hajime Sato; Stephen W Schaeffer; Michael C Schatz; Todd Schlenke; Russell Schwartz; Carmen Segarra; Rama S Singh; Laura Sirot; Marina Sirota; Nicholas B Sisneros; Chris D Smith; Temple F Smith; John Spieth; Deborah E Stage; Alexander Stark; Wolfgang Stephan; Robert L Strausberg; Sebastian Strempel; David Sturgill; Granger Sutton; Granger G Sutton; Wei Tao; Sarah Teichmann; Yoshiko N Tobari; Yoshihiko Tomimura; Jason M Tsolas; Vera L S Valente; Eli Venter; J Craig Venter; Saverio Vicario; Filipe G Vieira; Albert J Vilella; Alfredo Villasante; Brian Walenz; Jun Wang; Marvin Wasserman; Thomas Watts; Derek Wilson; Richard K Wilson; Rod A Wing; Mariana F Wolfner; Alex Wong; Gane Ka-Shu Wong; Chung-I Wu; Gabriel Wu; Daisuke Yamamoto; Hsiao-Pei Yang; Shiaw-Pyng Yang; James A Yorke; Kiyohito Yoshida; Evgeny Zdobnov; Peili Zhang; Yu Zhang; Aleksey V Zimin; Jennifer Baldwin; Amr Abdouelleil; Jamal Abdulkadir; Adal Abebe; Brikti Abera; Justin Abreu; St Christophe Acer; Lynne Aftuck; Allen Alexander; Peter An; Erica Anderson; Scott Anderson; Harindra Arachi; Marc Azer; Pasang Bachantsang; Andrew Barry; Tashi Bayul; Aaron Berlin; Daniel Bessette; Toby Bloom; Jason Blye; Leonid Boguslavskiy; Claude Bonnet; Boris Boukhgalter; Imane Bourzgui; Adam Brown; Patrick Cahill; Sheridon Channer; Yama Cheshatsang; Lisa Chuda; Mieke Citroen; Alville Collymore; Patrick Cooke; Maura Costello; Katie D'Aco; Riza Daza; Georgius De Haan; Stuart DeGray; Christina DeMaso; Norbu Dhargay; Kimberly Dooley; Erin Dooley; Missole Doricent; Passang Dorje; Kunsang Dorjee; Alan Dupes; Richard Elong; Jill Falk; Abderrahim Farina; Susan Faro; Diallo Ferguson; Sheila Fisher; Chelsea D Foley; Alicia Franke; Dennis Friedrich; Loryn Gadbois; Gary Gearin; Christina R Gearin; Georgia Giannoukos; Tina Goode; Joseph Graham; Edward Grandbois; Sharleen Grewal; Kunsang Gyaltsen; Nabil Hafez; Birhane Hagos; Jennifer Hall; Charlotte Henson; Andrew Hollinger; Tracey Honan; Monika D Huard; Leanne Hughes; Brian Hurhula; M Erii Husby; Asha Kamat; Ben Kanga; Seva Kashin; Dmitry Khazanovich; Peter Kisner; Krista Lance; Marcia Lara; William Lee; Niall Lennon; Frances Letendre; Rosie LeVine; Alex Lipovsky; Xiaohong Liu; Jinlei Liu; Shangtao Liu; Tashi Lokyitsang; Yeshi Lokyitsang; Rakela Lubonja; Annie Lui; Pen MacDonald; Vasilia Magnisalis; Kebede Maru; Charles Matthews; William McCusker; Susan McDonough; Teena Mehta; James Meldrim; Louis Meneus; Oana Mihai; Atanas Mihalev; Tanya Mihova; Rachel Mittelman; Valentine Mlenga; Anna Montmayeur; Leonidas Mulrain; Adam Navidi; Jerome Naylor; Tamrat Negash; Thu Nguyen; Nga Nguyen; Robert Nicol; Choe Norbu; Nyima Norbu; Nathaniel Novod; Barry O'Neill; Sahal Osman; Eva Markiewicz; Otero L Oyono; Christopher Patti; Pema Phunkhang; Fritz Pierre; Margaret Priest; Sujaa Raghuraman; Filip Rege; Rebecca Reyes; Cecil Rise; Peter Rogov; Keenan Ross; Elizabeth Ryan; Sampath Settipalli; Terry Shea; Ngawang Sherpa; Lu Shi; Diana Shih; Todd Sparrow; Jessica Spaulding; John Stalker; Nicole Stange-Thomann; Sharon Stavropoulos; Catherine Stone; Christopher Strader; Senait Tesfaye; Talene Thomson; Yama Thoulutsang; Dawa Thoulutsang; Kerri Topham; Ira Topping; Tsamla Tsamla; Helen Vassiliev; Andy Vo; Tsering Wangchuk; Tsering Wangdi; Michael Weiand; Jane Wilkinson; Adam Wilson; Shailendra Yadav; Geneva Young; Qing Yu; Lisa Zembek; Danni Zhong; Andrew Zimmer; Zac Zwirko; David B Jaffe; Pablo Alvarez; Will Brockman; Jonathan Butler; CheeWhye Chin; Sante Gnerre; Manfred Grabherr; Michael Kleber; Evan Mauceli; Iain MacCallum
Journal: Nature Date: 2007-11-08 Impact factor: 49.962

10. Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California.

Authors: Xianding Deng; Wei Gu; Scot Federman; Louis du Plessis; Oliver G Pybus; Nuno R Faria; Candace Wang; Guixia Yu; Brian Bushnell; Chao-Yang Pan; Hugo Guevara; Alicia Sotomayor-Gonzalez; Kelsey Zorn; Allan Gopez; Venice Servellita; Elaine Hsu; Steve Miller; Trevor Bedford; Alexander L Greninger; Pavitra Roychoudhury; Lea M Starita; Michael Famulare; Helen Y Chu; Jay Shendure; Keith R Jerome; Catie Anderson; Karthik Gangavarapu; Mark Zeller; Emily Spencer; Kristian G Andersen; Duncan MacCannell; Clinton R Paden; Yan Li; Jing Zhang; Suxiang Tong; Gregory Armstrong; Scott Morrow; Matthew Willis; Bela T Matyas; Sundari Mase; Olivia Kasirye; Maggie Park; Godfred Masinde; Curtis Chan; Alexander T Yu; Shua J Chai; Elsa Villarino; Brandon Bonin; Debra A Wadford; Charles Y Chiu
Journal: Science Date: 2020-06-08 Impact factor: 47.728

18 in total

1. The rise and spread of the SARS-CoV-2 AY.122 lineage in Russia.

Authors: Galya V Klink; Ksenia R Safina; Elena Nabieva; Nikita Shvyrev; Sofya Garushyants; Evgeniia Alekseeva; Andrey B Komissarov; Daria M Danilenko; Andrei A Pochtovyi; Elizaveta V Divisenko; Lyudmila A Vasilchenko; Elena V Shidlovskaya; Nadezhda A Kuznetsova; Anna S Speranskaya; Andrei E Samoilov; Alexey D Neverov; Anfisa V Popova; Gennady G Fedonin; Vasiliy G Akimkin; Dmitry Lioznov; Vladimir A Gushchin; Vladimir Shchur; Georgii A Bazykin
Journal: Virus Evol Date: 2022-03-05

2. Identifying SARS-CoV-2 regional introductions and transmission clusters in real time.

Authors: Jakob McBroome; Jennifer Martin; Adriano de Bernardi Schneider; Yatish Turakhia; Russell Corbett-Detig
Journal: Virus Evol Date: 2022-06-16

3. VGsim: Scalable viral genealogy simulator for global pandemic.

Authors: Vladimir Shchur; Vadim Spirin; Dmitry Sirotkin; Evgeni Burovski; Nicola De Maio; Russell Corbett-Detig
Journal: PLoS Comput Biol Date: 2022-08-24 Impact factor: 4.779

4. VOC-alarm: Mutation-based prediction of SARS-CoV-2 variants of concern.

Authors: Hongyu Zhao; Kun Han; Chao Gao; Vithal Madhira; Umit Topaloglu; Yong Lu; Guangxu Jin
Journal: Bioinformatics Date: 2022-05-31 Impact factor: 6.931

5. Efficient ancestry and mutation simulation with msprime 1.0.

Authors: Franz Baumdicker; Gertjan Bisschop; Daniel Goldstein; Graham Gower; Aaron P Ragsdale; Georgia Tsambos; Sha Zhu; Bjarki Eldon; E Castedo Ellerman; Jared G Galloway; Ariella L Gladstein; Gregor Gorjanc; Bing Guo; Ben Jeffery; Warren W Kretzschumar; Konrad Lohse; Michael Matschiner; Dominic Nelson; Nathaniel S Pope; Consuelo D Quinto-Cortés; Murillo F Rodrigues; Kumar Saunack; Thibaut Sellinger; Kevin Thornton; Hugo van Kemenade; Anthony W Wohns; Yan Wong; Simon Gravel; Andrew D Kern; Jere Koskela; Peter L Ralph; Jerome Kelleher
Journal: Genetics Date: 2022-03-03 Impact factor: 4.402

6. Pandemic-scale phylogenetics.

Authors: Cheng Ye; Bryan Thornlow; Alexander Kramer; Jakob McBroome; Angie Hinrichs; Russell Corbett-Detig; Yatish Turakhia
Journal: bioRxiv Date: 2021-12-06

7. The rise and spread of the SARS-CoV-2 AY.122 lineage in Russia.

Authors: Galya V Klink; Ksenia Safina; Elena Nabieva; Nikita Shvyrev; Sofya Garushyants; Evgeniia Alekseeva; Andrey B Komissarov; Daria M Danilenko; Andrei A Pochtovyi; Elizaveta V Divisenko; Lyudmila A Vasilchenko; Elena V Shidlovskaya; Nadezhda A Kuznetsova; Andrei E Samoilov; Alexey D Neverov; Anfisa V Popova; Gennady G Fedonin; Vasiliy G Akimkin; Dmitry Lioznov; Vladimir A Gushchin; Vladimir Shchur; Georgii A Bazykin
Journal: medRxiv Date: 2021-12-05

8. Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness.

Authors: Fritz Obermeyer; Martin Jankowiak; Nikolaos Barkas; Stephen F Schaffner; Jesse D Pyle; Lonya Yurkovetskiy; Matteo Bosso; Daniel J Park; Mehrtash Babadi; Bronwyn L MacInnis; Jeremy Luban; Pardis C Sabeti; Jacob E Lemieux
Journal: medRxiv Date: 2022-02-16

9. Wastewater sequencing uncovers early, cryptic SARS-CoV-2 variant transmission.

Authors: Smruthi Karthikeyan; Joshua I Levy; Peter De Hoff; Greg Humphrey; Amanda Birmingham; Kristen Jepsen; Sawyer Farmer; Helena M Tubb; Tommy Valles; Caitlin E Tribelhorn; Rebecca Tsai; Stefan Aigner; Shashank Sathe; Niema Moshiri; Benjamin Henson; Adam M Mark; Abbas Hakim; Nathan A Baer; Tom Barber; Pedro Belda-Ferre; Marisol Chacón; Willi Cheung; Evelyn S Cresini; Emily R Eisner; Alma L Lastrella; Elijah S Lawrence; Clarisse A Marotz; Toan T Ngo; Tyler Ostrander; Ashley Plascencia; Rodolfo A Salido; Phoebe Seaver; Elizabeth W Smoot; Daniel McDonald; Robert M Neuhard; Angela L Scioscia; Alysson M Satterlund; Elizabeth H Simmons; Dismas B Abelman; David Brenner; Judith C Bruner; Anne Buckley; Michael Ellison; Jeffrey Gattas; Steven L Gonias; Matt Hale; Faith Hawkins; Lydia Ikeda; Hemlata Jhaveri; Ted Johnson; Vince Kellen; Brendan Kremer; Gary Matthews; Ronald W McLawhon; Pierre Ouillet; Daniel Park; Allorah Pradenas; Sharon Reed; Lindsay Riggs; Alison Sanders; Bradley Sollenberger; Angela Song; Benjamin White; Terri Winbush; Christine M Aceves; Catelyn Anderson; Karthik Gangavarapu; Emory Hufbauer; Ezra Kurzban; Justin Lee; Nathaniel L Matteson; Edyth Parker; Sarah A Perkins; Karthik S Ramesh; Refugio Robles-Sikisaka; Madison A Schwab; Emily Spencer; Shirlee Wohl; Laura Nicholson; Ian H Mchardy; David P Dimmock; Charlotte A Hobbs; Omid Bakhtar; Aaron Harding; Art Mendoza; Alexandre Bolze; David Becker; Elizabeth T Cirulli; Magnus Isaksson; Kelly M Schiabor Barrett; Nicole L Washington; John D Malone; Ashleigh Murphy Schafer; Nikos Gurfield; Sarah Stous; Rebecca Fielding-Miller; Richard S Garfein; Tommi Gaines; Cheryl Anderson; Natasha K Martin; Robert Schooley; Brett Austin; Duncan R MacCannell; Stephen F Kingsmore; William Lee; Seema Shah; Eric McDonald; Alexander T Yu; Mark Zeller; Kathleen M Fisch; Christopher Longhurst; Patty Maysent; David Pride; Pradeep K Khosla; Louise C Laurent; Gene W Yeo; Kristian G Andersen; Rob Knight
Journal: medRxiv Date: 2022-04-04

10. Maximum likelihood pandemic-scale phylogenetics.

Authors: Nicola De Maio; Prabhav Kalaghatgi; Yatish Turakhia; Russell Corbett-Detig; Bui Quang Minh; Nick Goldman
Journal: bioRxiv Date: 2022-03-22