| Literature DB >> 33585048 |
Xiaolu Tang1, Ruochen Ying1, Xinmin Yao1, Guanghao Li2,3, Changcheng Wu1, Yiyuli Tang2, Zhida Li4, Bishan Kuang4, Feng Wu4, Changsheng Chi4, Xiaoman Du4, Yi Qin4, Shenghan Gao5, Songnian Hu5, Juncai Ma6, Tiangang Liu7, Xinghuo Pang8, Jianwei Wang9,10, Guoping Zhao11, Wenjie Tan12, Yaping Zhang2, Xuemei Lu2, Jian Lu1.
Abstract
The pandemic due to the SARS-CoV-2 virus, the etiological agent of Coronavirus Disease 2019 (COVID-19), has caused immense global disruption. With the rapid accumulation of SARS-CoV-2 genome sequences, however, thousands of genomic variants of SARS-CoV-2 are now publicly available. To improve the tracing of the viral genomes' evolution during the development of the pandemic, we analyzed single nucleotide variants (SNVs) in 121,618 high-quality SARS-CoV-2 genomes. We divided these viral genomes into two major lineages (L and S) based on variants at sites 8782 and 28144, and further divided the L lineage into two major sublineages (L1 and L2) using SNVs at sites 3037, 14408, and 23403. Subsequently, we categorized them into 130 sublineages (37 in S, 35 in L1, and 58 in L2) based on marker SNVs at 201 additional genomic sites. This lineage/sublineage designation system has a hierarchical structure and reflects the relatedness among the subclades of the major lineages. We also provide a companion website (www.covid19evolution.net) that allows users to visualize sublineage information and upload their own SARS-CoV-2 genomes for sublineage classification. Finally, we discussed the possible roles of compensatory mutations and natural selection during SARS-CoV-2's evolution. These efforts will improve our understanding of the temporal and spatial dynamics of SARS-CoV-2's genome evolution.Entities:
Keywords: COVID-19; adaptive evolution; compensatory advantageous mutation; evolutionary analysis; lineage designation
Year: 2021 PMID: 33585048 PMCID: PMC7864783 DOI: 10.1016/j.scib.2021.02.012
Source DB: PubMed Journal: Sci Bull (Beijing) ISSN: 2095-9273 Impact factor: 11.780
Fig. 1The phylogenetic tree of 10,061 SARS-CoV-2 genomes. The phylogenetic tree was rooted with the bat coronavirus RaTG13 and GD Pangolin-CoV (the SARS-CoV-2-related viruses in Malayan pangolin samples obtained by anti-smuggling operations by the Guangdong (GD) customs). Note that S was clearly delineated from L, and L further separated into L1 and L2 major sublineages. Genomes from each lineage are colored (S: red; L1: green; L2: blue). The genomes that could not be assigned to S or L are in purple, and the L-lineage genomes that could not be assigned to L1 or L2 are in yellow. Long branches are pruned for better visualization.
Pairwise LD analysis for the marker SNVs at sites 8782/28144 (S/L delineation) and sites 3037/14408/23403 (L1/L2 delineation).
| Pair of sites | |||
|---|---|---|---|
| 8782, 28144 | 0.939 (1207) | 0.953 (14,874) | 0.952 (24,003) |
| 3037, 14408 | 0.965 (2492) | 0.957 (32,084) | 0.953 (51,414) |
| 3037, 23403 | 0.966 (2464) | 0.963 (32,168) | 0.958 (51,531) |
| 14408, 23403 | 0.941 (2396) | 0.947 (31,574) | 0.948 (50,951) |
LD was analyzed for the four pairs of sites in three datasets: (1) 10,061 genomes used for the construction of phylogenetic tree; (2) 121,618 genomes obtained after redundancy filtering and quality control; (3) 202,679 genomes obtained after initial quality control. The LOD value is presented in parentheses.
Fig. 2The frequencies of haplotypes for LD pairs. The normalized frequencies of the four haplotypes (namely AB, Ab, aB, and ab; A and B are the ancestral alleles, and a and b are the derived alleles) for the 202 significant LD pairs. Each dot means the frequency of a certain haplotype for a pair, and the four haplotypes for an LD pair are connected with lines.
The observed numbers of haplotypes for seven pairs of sites.
Global: all the 121,618 genomes were considered. For the pairs other than 8782/28144, the sizes of the haplotypes (the numbers of genomes) in a major clade are also given in parentheses. The inferred ancestral nucleotides are in black, and the derived variants are in red.
Fig. 3The hierarchical structure of the sublineage designation system based on marker SNVs. (a) The hierarchical structure of S, L1, and L2 lineages/sublineages. (b–d) Hierarchical structures within S, L1, and L2, respectively. In (a–d), a colored triangle represented a subclade lineage, and the width of the triangle was in scale to the number of the genomes in a clade. For a sublineage, the number of genomes, as well as its percentage in the major clades ((a) for all the genomes; (b–d) for S, L1, and L2, respectively), were given in parentheses. All the SNVs were in coding regions, and the derived alleles (nonsyn, red; syn, blue) labeled in each branch were shared by all the descendant subclades. Except sites 8782 and 28144, the nucleotides in the reference genome at all the other 204 marker SNV sites were inferred to be the ancestral states. All the variants were given in the ancestral/position/derived format. The detailed information for the SNVs that specifically define each lineage or sublineage is given in Table S6 (online). Note that these schemes illustrate how the lineages and sublineages are defined based on the derived variants in a hierarchical manner, and they are not presented in the strict formats of phylogenetic trees.
Fig. 4The haplotype network of the 130 sublineages. The 206 marker SNVs were considered in the haplotype network analysis, and the major haplotype of each sublineage was used as the representative of that sublineage. The size of each sublineage was scaled to the number of genomes in that sublineage. The number of variants (out of 206 sites) between two neighboring sublineages is labeled (red, nonsyn; black, syn). Note that, (1) although SARS-CoV-2 and RaTG13 differed by at least 1000 nucleotides at the genome level, only 206 orthologous sites were considered in the haplotype network analysis; (2) the haplotype network reflected the relative relatedness between the haplogroups but did not necessarily mean one haplotype directly evolved from the neighboring ancestral haplotype because some of the intermediate genomes might be missing in the genomes so far sequenced; and (3) an edge linking RaTG13 and the S7 node (distinct from S2 by the U29095 variant) was manually removed in the haplotype network because it was likely caused by a recurrent mutation on site 29095 in S7, which resembled the same state as RaTG13 on the orthologous site.
Fig. 5Temporal and spatial distributions of the sublineages in the whole world (a) and individual continents (b–g) based on 119,168 genomes that had detailed date information. The number of genomes was summarized at a two-week interval, and the frequency of each sublineage (S1–S10, L1a–L1j, L2a–L2g, and other sublineages) in each interval was calculated.
Fig. 6The possible evolutionary paths for the haplotypes in five representative linkage groups (a–e). In each linkage group, the inferred ancestral nucleotides are in black, and the derived variants are in red. The haplotypes that carry the reference alleles are presented in a grey background. The numbers of the haplotypes and the possible evolutionary paths from the ancestral to derived via the transitional haplotypes in a clade are given.