| Literature DB >> 33966351 |
Xumin Ou1,2,3, Zhishuang Yang1,3,4, Dekang Zhu1,3, Sai Mao1,3,4, Mingshu Wang1,3,4, Renyong Jia1,3,4, Shun Chen1,3,4, Mafeng Liu1,3,4, Qiao Yang1,3,4, Ying Wu1,3,4, Xinxin Zhao1,3,4, Shaqiu Zhang1,3,4, Juan Huang1,3,4, Qun Gao1,3,4, Yunya Liu1,3,4, Ling Zhang1,3,4, Maikel Peppelenbosch2, Qiuwei Pan2,5, Anchun Cheng1,3,4.
Abstract
Highly pathogenic coronaviruses, including SARS-CoV-2, SARS-CoV and MERS-CoV, are thought to be transmitted from bats to humans, but the viral genetic signatures that contribute to bat-to-human transmission remain largely obscure. In this study, we identified an identical ribosomal frameshift motif among the three bat-human pairs of viruses and strong purifying selection after jumping from bats to humans. This represents genetic signatures of coronaviruses that are related to bat-to-human transmission. To further trace the early human-to-human transmission of SARS-CoV-2 in North America, a geographically stratified genome-wide association study (North American isolates and the remaining isolates) and a retrospective study were conducted. We determined that the single nucleotide polymorphisms (SNPs) 1,059.C > T and 25,563.G > T were significantly associated with approximately half of the North American SARS-CoV-2 isolates that accumulated largely during March 2020. Retrospectively tracing isolates with these two SNPs was used to reconstruct the early, reliable transmission history of North American SARS-CoV-2, and European isolates (February 26, 2020) showed transmission 3 days earlier than North American isolates and 17 days earlier than Asian isolates. Collectively, we identified the genetic signatures of the three pairs of coronaviruses and reconstructed an early transmission history of North American SARS-CoV-2. We envision that these genetic signatures are possibly diagnosable and predic markers for public health surveillance.Entities:
Keywords: GWAS; SARS-CoV-2; genetic signatures; geographic transmission
Mesh:
Year: 2021 PMID: 33966351 PMCID: PMC8242746 DOI: 10.1111/tbed.14148
Source DB: PubMed Journal: Transbound Emerg Dis ISSN: 1865-1674 Impact factor: 4.521
List of top 21 hits of causative SNPs
| # | SNP | Gene | Effects |
| SNP Frequencies (%) | ||||
|---|---|---|---|---|---|---|---|---|---|
| North America | Lineage B |
|
|
| |||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 | 17,858.A > G | ORF1ab | Missense(Met5865Val) | 3.57E−117 | 194/1063(18%) | 0/818(0%) | 0/691(0%) | 0/951(0%) | 0/448(0%) |
| 4 | 17,747.C > T | ORF1ab | Synonymous | 1.43E−116 | 193/1063(18%) | 0/818(0%) | 0/691(0%) | 0/951(0%) | 4/448(1%) |
| 5 | 18,060.C > T | ORF1ab | Missense(Ser5932Phe) | 2.42E−113 | 196/1063(18%) | 1/818(0%) | 0/691(0%) | 0/951(0%) | 0/448(0%) |
| 6 | 29,553.G > A | ORF10 | Upstream | 7.26E−51 | 80/1063(8%) | 80/818(10%) | 80/691(12%) | 0/951(0%) | 30/448(7%) |
| 7 | 28,882.G > A |
| Synonymous | 6.15E−39 | 53/1063(5%) | 53/818(6%) | 53/691(8%) | 207/951(22%) | 30/448(7%) |
| 8 | 28,883.G > C |
| Missense(Gly204Arg) | 6.15E−39 | 53/1063(5%) | 53/818(6%) | 53/691(8%) | 207/951(22%) | 30/448(7%) |
| 9 | 28,881.G > A |
| Missense(Arg203Lys) | 2.77E−38 | 54/1063(5%) | 54/818(7%) | 54/691(8%) | 207/951(22%) | 87/448(19%) |
| 10 | 28,144.T > C | ORF8 | Missense(Leu84Ser) | 9.73E−33 | 245/1063(23%) | 0/818(0%) | 0/691(0%) | 49/951(5%) | 0/448(0%) |
| 11 | 27,964.C > T | ORF8 | Missense(Ser24Leu) | 2.74E−29 | 47/1063(4%) | 47/818(6%) | 47/691(7%) | 0/951(0%) | 2/448(0%) |
| 12 | 11,916.C > T | ORF1ab | Missense(Ser3884Leu) | 4.87E−29 | 51/1063(5%) | 51/818(6%) | 51/691(7%) | 0/951(0%) | 96/448(21%) |
| 13 | 8,782.C > T | ORF1ab | Synonymous | 4.03E−28 | 245/1063(23%) | 0/818(0%) | 0/691(0%) | 53/951(6%) | 9/448(2%) |
| 14 | 15,324.C > T | ORF1ab | Missense(Thr5020Ile) | 5.83E−27 | 2/1063(0.2%) | 2/818(0%) | 2/691(0%) | 80/951(8%) | 130/448(29%) |
| 15 | 11,083.G > T | ORF1ab | Missense(Leu3606Phe) | 3.99E−25 | 71/1063(7%) | 65/818(8%) | 5/691(1%) | 92/951(10%) | 2/448(0%) |
| 16 | 18,998.C > T | ORF1ab | Missense(His6245Tyr) | 1.10E−22 | 41/1063(4%) | 41/818(5%) | 41/691(6%) | 0/951(0%) | 2/448(0%) |
| 17 | 29,540.G > A | ORF10 | Upstream | 1.10E−22 | 41/1063(4%) | 41/818(5%) | 41/691(6%) | 0/951(0%) | 8/448(2%) |
| 18 | 18,877.C > T | ORF1ab | Synonymous(His6245Tyr) | 2.31E−19 | 65/1063(6%) | 63/818(8%) | 62/691(9%) | 10/951(1%) | 0/448(0%) |
| 19 | 29,711.G > T | 5'UTR | Downstream | 4.30E−19 | 31/1063(3%) | 31/818(4%) | 1/691(0%) | 0/951(0%) | 1/448(0%) |
| 20 | 1604.AATG>A | ORF1ab | Deletion(delTGA) | 2.87E−17 | 4/1063(0.4%) | 4/818(0%) | 0/691(0%) | 69/951(7%) | 2/448(0%) |
| 21 | 27,046.C > T | M | Missense(Thr175Met) | 3.48E−17 | 2/1063(0.2%) | 2/818(0%) | 2/691(0%) | 60/951(6%) | 60/951(13%) |
The 'bold values' refers to the two SNPs used for the reconstruction of early transmission history.
North America lineage B or B.1.
FIGURE 1Genomic signatures of bat–human SARS‐CoV‐2, SARS‐CoV and MERS‐CoV pairs. For the three CoV pairs, the necessary proteins are encoded in the same order in the genome (i.e. ORF1ab‐S‐E‐M‐N). The accessory protein‐encoding genes vary by locus. SARS‐CoV‐2 and SARS‐CoV have gene insertions between the M protein and N protein‐coding genes, such as ORF6, ORF7a(b) and ORF8, while the same locus has no insertion in MERS‐CoV. Of note, a novel ORF10 (yellow box) of human SARS‐CoV‐2 is less related to any human or CoV gene. Importantly, ORF6 and ORF7a of human SARS‐CoV‐2 are related to the equivalents of SARS‐CoV isolated from both bats and humans (lower panel). Two new hypothetical proteins (i.e. HP1 and HP2) (yellow box) of bat SARS‐CoV‐2 are evolutionarily close to ORF9a and ORF9b of SARS‐CoV. Bat‐SARS‐CoV‐2 #1 and #2 represent two related bat CoVs. The phylogenetic tree was constructed by the maximum likelihood method. The evolutionary distances are calculated by base differences per site. Confidence probability was estimated using the bootstrap test (100 replicates)
FIGURE 2Evolutionary signatures of necessary protein‐encoding genes of SARS‐CoV‐2. The differences between amino acid substitutions were used to construct phylogenetic trees (left panel). The same analysis was also performed by the differences in nucleotide substitutions (middle panel). dN/dS ratio matrixes are displayed (Tables S12–16) (right panel). The sequential analysis of necessary proteins was performed from top to bottom (i.e. ORF1ab‐S‐E‐M‐N). The phylogenetic tree was constructed by the maximum likelihood method. Bat‐SARS‐CoV‐2 #1 and #2 represent two related bat CoVs. The evolutionary distances are calculated by the differences between amino acid substitutions or nucleotide substitutions per site. Confidence probability was estimated using the bootstrap test (100 replicates)
FIGURE 3Phylogenetic tree of early global SARS‐CoV‐2. The 2,599 full‐genome sequences were used to construct the phylogenetic tree via the maximum likelihood (ML) method using IQ‐TREE 2 software (version 2.1.2, model: GTR+Γ). Accordingly, the early SARS‐CoV‐2 isolates were rooted in two lineages, lineages A (n = 413) and B (n = 2,186), in which North American isolates dominated lineage B (n = 818) and sub‐lineage B.1 (n = 691). The constituents of the main lineages A and B as well as lineage B.1 are displayed by three pie charts. Specifically, North American and European isolates dominate lineage B.1
FIGURE 4GWAS and linkage disequilibrium (LD) analysis. (a) Manhattan plot comparing the North American SARS‐CoV‐2 isolates (n = 1,063) to the isolates from the remaining continents (n = 1536). Genomic coordinates are displayed along the X‐axis, and ‐log10 of the association p‐value for each SNP is displayed on the Y‐axis (threshold p‐value =1.00 × 10–15). Different blocks indicate the different protein‐encoding regions. (b) Linkage disequilibrium between SNPs in SARS‐CoV‐2. LD plot of any two SNP pairs among the 21 sites. The number near slashes shows the genomic coordinates. The colour in the square is given by the standard (D'/LOD), and the number in the square is the r2 value
FIGURE 5Retrospectively tracing the early SARS‐CoV‐2 isolates with SNPs (1,059.C > T & 25,563.G > T) of all continents and North American lineage B.1. (a) The time‐dependent accumulating plot for frequencies of the two SNPs (1,059.C > T & 25,563.G > T) between continents. The continents are labelled by different colours, and the continents of the first occurrence of the two SNPs are indicated. (b) The time‐dependent accumulation plot of North American lineage B.1. Of note, the two SNPs largely and concurrently accumulated during mid to late March and occurred most frequently in isolates of North American lineage B.1 (479/691, 69.31%) (Table 1)