| Literature DB >> 33303849 |
Zhenglin Zhu1, Kaiwen Meng2, Geng Meng3.
Abstract
To trace the evolution of coronaviruses and reveal the possible origin of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which causes the coronavirus disease 2019 (COVID-19), we collected and thoroughly analyzed 29,452 publicly available coronavirus genomes, including 26,312 genomes of SARS-CoV-2 strains. We observed coronavirus recombination events among different hosts including 3 independent recombination events with statistical significance between some isolates from humans, bats and pangolins. Consistent with previous records, we also detected putative recombination between strains similar or related to Bat-CoV-RaTG13 and Pangolin-CoV-2019. The putative recombination region is located inside the receptor-binding domain (RBD) of the spike glycoprotein (S protein), which may represent the origin of SARS-CoV-2. Population genetic analyses provide estimates suggesting that the putative introduced DNA within the RBD is undergoing directional evolution. This may result in the adaptation of the virus to hosts. Unsurprisingly, we found that the putative recombination region in S protein was highly diverse among strains from bats. Bats harbor numerous coronavirus subclades that frequently participate in recombination events with human coronavirus. Therefore, bats may provide a pool of genetic diversity for the origin of SARS-CoV-2.Entities:
Year: 2020 PMID: 33303849 PMCID: PMC7728743 DOI: 10.1038/s41598-020-78703-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Three putative recombination events between bat and pangolin coronaviruses.
| Position | Major parent | Minor parent | Recombinant | Statistic tests ( | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Start | End | RDP | GENECONV | Bootscan | Maxchi | Chimaera | SiSscan | 3Seq | |||
| 16,623 | 17,891 | Some strains similar to Pangolin-CoV-2017 (410,541) | Some strains similar to Bat-SL-CoV (MG772933) | Some ancestral strain of SARS-CoV-2, Bat-CoV-RaTG13 and Pangolin-CoV-2019 (MN908947) | 2.29E−13 | 1.43E−03 | 2.59E−11 | 3.82E−05 | 2.01E−06 | 1.26E−11 | 1.39E−08 |
| 21,187 | 22,368 | Some strain similar to SARS-CoV-2 (MN908947) | Some strains similar to Bat-SL-CoV (MG772933) | Some strains similar to Pangolin-CoV-2019 (412,860) | 6.20E−43 | 1.75E−12 | 6.52E−06 | 2.25E−14 | 7.05E−09 | 1.75E−10 | 1.26E−06 |
| 22,870 | 23,099 | Some strains similar to Bat-CoV-RaTG13 (MN996532) | Some strains similar to Pangolin-CoV-2019 (412,860) | Some strains similar to SARS-CoV-2 (MN908947) | 5.80E−14 | 1.83E−04 | 1.48E−04 | 5.02E−03 | 6.84E−04 | NS | 1.02E−11 |
‘Position’ refers to the start and end of the reference genome MN908947 (SARS-CoV-2). ’NS’ means not significant. The major parent and minor parent are the presumed parent contributing the larger fraction of the sequence and the presumed parent contributing the smaller fraction of the sequence, respectively. In cells, following the strain name, a representative strain ID is listed within a pair of small brackets. P-values based on seven statistical tests are also listed. Plots of alignments supporting these recombination events are shown in Fig. S1. Sequence IDs in brackets are exemplary sequences of the described strains.
Figure 1Verification of the three recombination events from phylogenetic trees. (A) Whole genome phylogenetic tree. (B) Phylogenetic tree built by sequences in RI_RNA_ORF1. (C) Phylogenetic tree built by sequences in RI_RNA_Boundary. (D) Phylogenetic tree built by sequences in RI_RNA_S. The trees were built using strains related to recombination and related outgroups. The names of the coronavirus to which the strains belong are listed to the right of the phylogenetic tree. The numbers marked in red are the marginal likelihoods of the tree. The trees were built by Mega using the Jukes-Cantor model. Phylogeny tests were performed using the bootstrap method with 5000 replicates.
Estimates of evolutionary divergence between coronavirus sequences obtained using the Tajima-Nei model.
| Species 1 | Species 2 | RI_RNA_S | 5′ left (2000 bp) | 3′ right (2000 bp) | |||
|---|---|---|---|---|---|---|---|
| Dist | Std. Err | Dist | Std. Err | Dist | Std. Err | ||
| SARS-CoV-2 | Bat-CoV-RaTG13 | *0.341 | 0.061 | *0.061 | 0.006 | *0.058 | 0.005 |
| SARS-CoV-2 | Pangolin-CoV-2019 | *0.139 | 0.032 | *0.247 | 0.014 | *0.100 | 0.007 |
| Bat-CoV-RaTG13 | Pangolin-CoV-2019 | *0.373 | 0.067 | *0.247 | 0.014 | *0.098 | 0.007 |
| SARS-CoV-2 | Pangolin-CoV-2017 | 0.264 | 0.050 | 0.184 | 0.012 | 0.151 | 0.010 |
| Bat-CoV-RaTG13 | Pangolin-CoV-2017 | 0.364 | 0.061 | 0.185 | 0.012 | 0.158 | 0.010 |
| Pangolin-CoV-2019 | Pangolin-CoV-2017 | 0.304 | 0.053 | 0.263 | 0.015 | 0.157 | 0.010 |
| SARS-CoV-2 | Bat-SL-CoV | 0.766 | 0.119 | 0.295 | 0.015 | 0.199 | 0.012 |
| Bat-CoV-RaTG13 | Bat-SL-CoV | 0.920 | 0.153 | 0.294 | 0.014 | 0.188 | 0.012 |
| Pangolin-CoV-2019 | Bat-SL-CoV | 0.742 | 0.113 | 0.190 | 0.011 | 0.206 | 0.013 |
| Pangolin-CoV-2017 | Bat-SL-CoV | 0.817 | 0.129 | 0.298 | 0.016 | 0.203 | 0.011 |
| SARS-CoV-2 | SARS-CoV | 0.514 | 0.079 | 0.340 | 0.017 | 0.260 | 0.012 |
| Bat-CoV-RaTG13 | SARS-CoV | 0.514 | 0.080 | 0.331 | 0.017 | 0.266 | 0.012 |
| Pangolin-CoV-2019 | SARS-CoV | 0.519 | 0.076 | 0.355 | 0.018 | 0.271 | 0.013 |
| Pangolin-CoV-2017 | SARS-CoV | 0.473 | 0.076 | 0.356 | 0.017 | 0.262 | 0.013 |
| Bat-SL-CoV | SARS-CoV | 0.970 | 0.169 | 0.381 | 0.017 | 0.236 | 0.012 |
| SARS-CoV-2 | MERS-CoV | 1.341 | 0.366 | 0.786 | 0.038 | 0.806 | 0.035 |
| Bat-CoV-RaTG13 | MERS-CoV | 1.277 | 0.324 | 0.772 | 0.037 | 0.810 | 0.035 |
| Pangolin-CoV-2019 | MERS-CoV | 1.177 | 0.246 | 0.824 | 0.043 | 0.818 | 0.034 |
| Pangolin-CoV-2017 | MERS-CoV | 1.304 | 0.282 | 0.817 | 0.038 | 0.830 | 0.036 |
| Bat-SL-CoV | MERS-CoV | 1.101 | 0.202 | 0.823 | 0.038 | 0.802 | 0.032 |
| SARS-CoV | MERS-CoV | 1.196 | 0.240 | 0.862 | 0.039 | 0.789 | 0.033 |
Analyses were performed on the sequences of RI_RNA_S (from 22,870 to 23,099 bp corresponding to MN908947), the 5′ left 2000 bp region (from 20,870 to 22,869 bp) and the 3′ right 2000 bp region (from 23,099 to 25,099 bp). The coronavirus genomes being compared are SARS-CoV-2 (MN908947), Bat-CoV-RaTG13 (MN996532), Pangolin-CoV-2019 (410,721), Pangolin-CoV-2017 (410,542), Bat-SL-CoV (MG772933), SARS-CoV (NC_004718) and MERS-CoV (NC_019843). ‘Dist.’ denotes genetic distance. ‘Std. Err’ denotes the standard error estimate(s). For convenience, we underlined the presumed recombinant (SARS-CoV-2). The values between the presumed recombinant and parents are marked by ‘*’.
Figure 2A sketch of the three recombination events and population genetic analysis results for RI_RNA_S. (A) Coordinate positions or positions of three recombinationally integrated RNA regions (indicated out by orange dotted lines) in the genome of SARS-CoV-2 (MN908947), with major proteins marked. ‘a’, ’b’ and ‘c’ refer to RI_RNA_ORF1, RI_RNA_Boundary and RI_RNA_S, respectively. Yellow represents the RBD in S protein. Red arrows with lines indicate the direction of transcription in SARS-CoV-2. (B) Diagram depicting a possible origin of SARS-CoV-2. (C) Snapshot of sliding window analysis of Fst (between coronaviruses from human and bat, human and pangolin, human and camel, human and cow as well as bat and pangolin). The region of RI_RNA_S is marked by a red rectangle. In the legend to the right, peaks at RI_RNA_S that are statistically significant (with values higher than the 0.05 threshold in the nearby region) are marked with ‘**’, and those with weak significance (with values higher than the 0.1 threshold in the nearby region) are marked with ‘*’. (D) Comparison of the distributions of Fst in RI_NA_S (red) and the nearby region (background, blue). Pairs of distributions in RI_NA_S and the flanking region were compared by the Wilcoxon rank sum test and a P-value is given. Vertical dashed lines denote the 0.05 cutoff (red) and 0.1 cutoff (orange) of the background distribution. (E) Sliding window analysis of CLRs with RI_RNA_S marked by a red rectangle. The result was generated using SARS-CoV-2 strains collected in April. Red triangles denote the two CLR peaks surrounding RI_RNA_S. The two peaks are significant or weakly significant if using the region nearby (from 21,000 to 25,000 bp) as a background, whose top 0.05 cutoff is denoted by a red dashed line and top 0.1 cutoff is denoted by an orange dashed line.
Figure 3Evidence showing that bats may be a pool of genetic diversity. (A) Comparison of Pi in the nearby region of RI_RNA_S for coronaviruses from 7 different hosts, such as bat, human and pangolin. Pi values were calculated through a sliding window approach in the region from 21,500 to 25,000 bp according to MN908947. (B) Numbers of subclades of coronavirus in different hosts. (C) ie chart showing the numbers of independent recombination events in different hosts. Bat harbored the highest number and is marked in red. (D) Heatmap showing the numbers of independent recombination events occurring in coronaviruses between pairs of hosts the x and y axes). We did not consider recombination events between coronaviruses from the same host, which are marked by black squares.