| Literature DB >> 33942847 |
Sudhir Kumar1,2,3, Qiqing Tao1,2, Steven Weaver1,2, Maxwell Sanderford1,2, Marcos A Caraballo-Ortiz1,2, Sudip Sharma1,2, Sergei L K Pond1,2, Sayaka Miura1,2.
Abstract
Global sequencing of genomes of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has continued to reveal new genetic variants that are the key to unraveling its early evolutionary history and tracking its global spread over time. Here we present the heretofore cryptic mutational history and spatiotemporal dynamics of SARS-CoV-2 from an analysis of thousands of high-quality genomes. We report the likely most recent common ancestor of SARS-CoV-2, reconstructed through a novel application and advancement of computational methods initially developed to infer the mutational history of tumor cells in a patient. This progenitor genome differs from genomes of the first coronaviruses sampled in China by three variants, implying that none of the earliest patients represent the index case or gave rise to all the human infections. However, multiple coronavirus infections in China and the United States harbored the progenitor genetic fingerprint in January 2020 and later, suggesting that the progenitor was spreading worldwide months before and after the first reported cases of COVID-19 in China. Mutations of the progenitor and its offshoots have produced many dominant coronavirus strains that have spread episodically over time. Fingerprinting based on common mutations reveals that the same coronavirus lineage has dominated North America for most of the pandemic in 2020. There have been multiple replacements of predominant coronavirus strains in Europe and Asia as well as continued presence of multiple high-frequency strains in Asia and North America. We have developed a continually updating dashboard of global evolution and spatiotemporal trends of SARS-CoV-2 spread (http://sars2evo.datamonkey.org/).Entities:
Keywords: coronavirus; phylogeny; web tool
Mesh:
Year: 2021 PMID: 33942847 PMCID: PMC8135569 DOI: 10.1093/molbev/msab118
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Fig. 1Counts of SNVs and genomes in the 29KG data set. (a) Cumulative count of SNVs presented in the 29KG genome data set at different frequencies. (b) The number of genomes in the 29KG collection that were isolated weekly during the pandemic. (c) The number of base differences from proCoV2 (see fig. 2) for genomes sampled in December 2019 and January 2020. The 18 genomes sampled in December 2019 in China (red) have three common SNVs different from proCoV2. In contrast, six genomes sampled in January 2020 in China (Asia, red) and the United States (North America, blue) show no base differences. Multiple genomes (2 and 15) were sampled on two different days. (d) Temporal and spatial distribution of strains identical to proCoV2 at the protein sequence level, that is, they have only μ mutations. The color scheme used to mark sampling locations is shown in panel b.
SARS-CoV-2 variants in 29KG dataset.
| Mutant (major) | Mutant (minor) | Gene | GenomicPosition | Nucleotide change | Amino acid change | Time(days) | VariantFrequency | Genomesmapped | Firstlocation |
|---|---|---|---|---|---|---|---|---|---|
| μ1 | ORF1ab | 2416 | U>C | 0 | 98.1% | 0 | China, Asia | ||
| μ2 | ORF1ab | 19524 | U>C | 0 | 98.6% | 0 | China, Asia | ||
| μ3 | S | 23929 | U>C | 0 | 98.4% | 18 | China, Asia | ||
| α1 | ORF1ab | 18060 | U>C | 0 | 95.1% | 849 | China, Asia | ||
| α1a | N | 28657 | C>U | 63 | 1.3% | 2 | France, Europe | ||
| α1b | ORF1ab | 9477 | U>A | F>Y | 63 | 1.2% | 3 | France, Europe | |
| α1c | N | 28863 | C>U | S>L | 63 | 1.2% | 5 | France, Europe | |
| α1d | ORF3a | 25979 | G>U | G>V | 63 | 1.2% | 344 | France, Europe | |
| α2 | ORF1ab | 8782 | U>C | 0 | 91.0% | 47 | China, Asia | ||
| α3 | ORF8 | 28144 | C>U | S>L | 0 | 90.8% | 1115 | China, Asia | |
| α3a | ORF1ab | 1606 | U>C | 43 | 1.7% | 501 | United Kingdom, Europe | ||
| α3b | ORF1ab | 11083 | G>U | L>F | 24 | 9.2% | 376 | China, Asia | |
| α3c | N | 28311 | C>U | P>L | 64 | 1.9% | 3 | South Korea, Asia | |
| α3d | ORF1ab | 13730 | C>U | A>V | 71 | 1.8% | 3 | Taiwan/Malaysia, Asia | |
| α3e | ORF1ab | 6312 | C>A | T>K | 71 | 1.7% | 483 | Taiwan/Malaysia, Asia | |
| α3f | ORF3a | 26144 | G>U | G>V | 28 | 5.1% | 121 | China, Asia | |
| α3g | ORF1ab | 14805 | C>U | 54 | 6.0% | 334 | United Kingdom, Europe | ||
| α3h | ORF1ab | 17247 | U>C | 64 | 2.0% | 580 | Switzerland, Europe | ||
| α3i | ORF1ab | 2558 | C>U | P>S | 54 | 1.7% | 26 | United Kingdom, Europe | |
| α3j | ORF1ab | 2480 | A>G | I>V | 54 | 1.6% | 462 | United Kingdom, Europe | |
| β1 | ORF1ab | 3037 | C>U | 31 | 77.0% | 11 | China, Asia | ||
| β2 | S | 23403 | A>G | D>G | 31 | 77.1% | 36 | China, Asia | |
| β3 | ORF1ab | 14408 | C>U | P>L | 41 | 76.9% | 3032 | Saudi Arabia, Middle East | |
| β3a | ORF1ab | 20268 | A>G | 64 | 5.7% | 1213 | Italy, Europe | ||
| β3b | N | 28854 | C>U | S>L | 29 | 3.1% | 527 | China, Asia | |
| β3c | ORF1ab | 15324 | C>U | 29 | 2.3% | 678 | China, Asia | ||
| β3d | ORF3a | 25429 | G>U | V>L | 77 | 1.7% | 485 | United Kingdom, Europe | |
| β3e | N | 28836 | C>U | S>L | 74 | 1.6% | 3 | Switzerland, Europe | |
| β3f | ORF1ab | 13862 | C>U | T>I | 74 | 1.6% | 50 | Switzerland, Europe | |
| β3g | ORF1ab | 10798 | C>A | D>E | 86 | 1.4% | 414 | United Kingdom, Europe | |
| γ1 | ORF3a | 25563 | G>U | Q>H | 41 | 29.8% | 884 | Saudi Arabia, Middle East | |
| γ1a | ORF1ab | 18877 | C>U | 41 | 4.0% | 757 | Saudi Arabia, Middle East | ||
| γ1b | M | 26735 | C>U | 41 | 1.5% | 439 | Saudi Arabia, Middle East | ||
| δ1 | ORF1ab | 1059 | C>U | T>I | 54 | 23.0% | 5157 | Singapore, Asia | |
| δ1a | S | 24368 | G>U | D>Y | 75 | 1.3% | 389 | Sweden, Europe | |
| δ1b | ORF8 | 27964 | C>U | S>L | 76 | 2.7% | 790 | USA, North America | |
| δ1c | ORF1ab | 11916 | C>U | S>L | 72 | 1.6% | 166 | USA, North America | |
| δ1d | ORF1ab | 18998 | C>U | A>V | 72 | 1.0% | 305 | USA, North America | |
| ε1 | N | 28881 | G>A | R>K | 54 | 25.7% | 2 | United Kingdom, Europe | |
| ε2 | N | 28882 | G>A | R>K | 54 | 25.7% | 2 | United Kingdom, Europe | |
| ε3 | N | 28883 | G>C | G>R | 54 | 25.7% | 5365 | United Kingdom, Europe | |
| ε3a | ORF1ab | 313 | C>U | 66 | 2.1% | 608 | USA, North America | ||
| ε3b | ORF1ab | 19839 | U>C | 64 | 1.5% | 452 | Switzerland, Europe | ||
| ε3c | M | 27046 | C>U | T>M | 69 | 1.6% | 453 | Worldwide | |
| ε3d | ORF1ab | 10097 | G>A | G>S | 69 | 2.5% | 5 | Denmark, Europe | |
| ε3e | S | 23731 | C>U | 69 | 2.5% | 403 | Denmark, Europe | ||
| ε3f | N | 28580 | G>U | D>Y | 69 | 1.2% | 353 | Chile, South America | |
| ν1 | ORF1ab | 17858 | A>G | Y>C | 59 | 4.7% | 32 | USA, North America | |
| ν2 | ORF1ab | 17747 | C>U | P>L | 59 | 4.7% | 1374 | USA, North America |
Amino acid change is shown only for non-synonymous change.
Fig. 2.Mutational history graph of SARS-CoV-2 from the 29KG data set. Thick arrows mark the pathway of widespread variants (frequency, vf ≥ 3%), and thin arrows show paths leading to other common mutations (3% > vf > 1%). The pie-chart sizes are proportional to variant frequencies in the 29KG data set, with pie-charts shown for variants with vf > 3% and pie color based on the world’s region where that mutation was first observed. A circle is used for all other variants, with the filled color corresponding to the earliest sampling region. The COI (black font) and the BCL (blue font) of each mutation and its predecessor mutation are shown next to the arrow connecting them. Underlined BCL values mark variant pairs for which BCLs were estimated for groups of variants (see Materials and Methods) because of the episodic nature of variant accumulation within groups resulting in lower BCLs (<80%, dashed arrows). Base changes (n.) are shown for synonymous mutations, and amino acid changes (p.) are shown for nonsynonymous mutations along with the gene/protein names (“ORF” is omitted from gene name abbreviations given in table 1). More details on each mutation are presented in table 1.
Fig. 3.A waterfall display of genome phylogeny recapitulating the mutation history in figure 2. The numbers of genomes mapped to each node are depicted by open circles (very few genomes), open triangles (few genomes), small gray triangles (many genomes), and large black triangles (very many genomes). The actual number of genomes is given in the parenthesis. The tip label is the name of the mutation on the connecting branch. Green and red branches are synonymous and nonsynonymous mutations, respectively. Thick branches mark mutations that occur with a frequency >3% in the 29KG data set. The yellow background highlights the diversity of coronavirus lineages that evolved from the genomes leading to Wuhan-1 coronavirus.
Fig. 4.The backbone of SARS-CoV-2 mutational history. The mutational history inferred was from (a) 29KG and (b) 68KG data sets. Major variants and their mutational pathways are shown in black, and minor variants and their mutational pathways are shown in gray. Circle color marks the region where variants were sampled first. The 68KG data set contains 12 additional variants and more than two times the genomes than the 29KG data set.
Fig. 5.Spatiotemporal dynamics of 172,480 SARS-CoV-2 genomes (December 2019–2020). Spatiotemporal patterns of genomes mapped to lineages containing different combinations of major variants in (a) Asia, (b) Europe, and (c) North America. The number of genomes mapped to major variant lineages contains all of its offshoots, for example, α lineage contains all the genomes with α1–α3, α1a–α1d, and α3a–α3j variants only. The stacked graph area is the proportion of genomes mapped to the corresponding lineage. The solid black line shows the count of total genome samples. Spatiotemporal patterns in cities, countries, and other regions are available online at http://sars2evo.datamonkey.org/ (last accessed on March 28, 2021).