Literature DB >> 33966351

Tracing genetic signatures of bat-to-human coronaviruses and early transmission of North American SARS-CoV-2.

Xumin Ou^1,2,3, Zhishuang Yang^1,3,4, Dekang Zhu^1,3, Sai Mao^1,3,4, Mingshu Wang^1,3,4, Renyong Jia^1,3,4, Shun Chen^1,3,4, Mafeng Liu^1,3,4, Qiao Yang^1,3,4, Ying Wu^1,3,4, Xinxin Zhao^1,3,4, Shaqiu Zhang^1,3,4, Juan Huang^1,3,4, Qun Gao^1,3,4, Yunya Liu^1,3,4, Ling Zhang^1,3,4, Maikel Peppelenbosch², Qiuwei Pan^2,5, Anchun Cheng^1,3,4.

Abstract

Highly pathogenic coronaviruses, including SARS-CoV-2, SARS-CoV and MERS-CoV, are thought to be transmitted from bats to humans, but the viral genetic signatures that contribute to bat-to-human transmission remain largely obscure. In this study, we identified an identical ribosomal frameshift motif among the three bat-human pairs of viruses and strong purifying selection after jumping from bats to humans. This represents genetic signatures of coronaviruses that are related to bat-to-human transmission. To further trace the early human-to-human transmission of SARS-CoV-2 in North America, a geographically stratified genome-wide association study (North American isolates and the remaining isolates) and a retrospective study were conducted. We determined that the single nucleotide polymorphisms (SNPs) 1,059.C > T and 25,563.G > T were significantly associated with approximately half of the North American SARS-CoV-2 isolates that accumulated largely during March 2020. Retrospectively tracing isolates with these two SNPs was used to reconstruct the early, reliable transmission history of North American SARS-CoV-2, and European isolates (February 26, 2020) showed transmission 3 days earlier than North American isolates and 17 days earlier than Asian isolates. Collectively, we identified the genetic signatures of the three pairs of coronaviruses and reconstructed an early transmission history of North American SARS-CoV-2. We envision that these genetic signatures are possibly diagnosable and predic markers for public health surveillance.

Entities: Chemical

Keywords: GWAS; SARS-CoV-2; genetic signatures; geographic transmission

Mesh：

Year: 2021 PMID： 33966351 PMCID： PMC8242746 DOI： 10.1111/tbed.14148

Source DB: PubMed Journal: Transbound Emerg Dis ISSN： 1865-1674 Impact factor: 4.521

INTRODUCTION

Since 2003, coronaviruses (CoVs), specifically, severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2), severe acute respiratory syndrome‐related coronavirus (SARS‐CoV, 2003) and Middle East respiratory syndrome‐related coronavirus (MERS‐CoV, 2012), have caused three epidemics in human populations worldwide, including the ongoing COVID‐19 pandemic caused by SARS‐CoV‐2 (Perlman, 2020; P. Zhou et al., 2020). Interestingly, these CoVs are thought to be associated with bat CoVs; for instance, human SARS‐CoV‐2 shares 96% nucleotide identity with bat CoVs (W. Li et al., 2005; P. Zhou et al., 2020). Human‐to‐human transmission of SARS‐CoV‐2 was confirmed on 14 January 2020 (Wu, Leung, & Leung, 2020). On 11 March 2020, the World Health Organization (WHO) officially declared that human COVID‐19 caused by SARS‐CoV‐2 was a global pandemic. As of May 31, over 5.93 million human cases have been confirmed globally, of which over 2.74 million cases were from America, particularly North America (Wu et al., 2020). Thus, it is very important to understand the cross‐species transmission of SARS‐CoV‐2 and its human‐to‐human transmission in America to inform public health measures (Ji et al., 2020). It is well known that viral genetic variations are associated with many aspects of virology, most notably viral infectivity and zoonotic transfer (Ou et al., 2019; J. H. Zhou et al., 2019). Identifying distinctive genetic signatures of CoVs common to those viruses found in different host species (human and bat) as well as different geographic regions (North America and the rest) may provide differential markers to support epidemiological surveillance. However, these distinctive signatures are currently unknown. SARS‐CoV‐2 massively mutates, which hampers its transmission tracing (Tang et al., 2020). Herein, we aimed to identify the genetic signatures of three pairs of bat‐to‐human CoVs as well as those of the North American SARS‐CoV‐2 isolates associated with the early human‐to‐human transmission history of COVID‐19. The whole genomic sequences of three bat–human pairs of SARS‐CoV‐2 were analysed as well as bat–human pairs of SARS‐CoV and MERS‐CoV; the latter also included a MERS‐CoV strain isolated from camel (Azhar et al., 2014). After a virus jumps to a human host, new mutations are fixed in the viral genome that may be geographically different. Identification of these fixed and common mutations remains a great challenge because of the complexity of these mutable viruses. In human population genetics studies, this type of complexity can be particularly addressed by genome‐wide association studies (GWASs) (Power et al., 2017). Therefore, we aimed to use a geographically stratified GWAS to address the complexity of global SARS‐CoV‐2 isolates. We primarily found that the genomic organization of the three human CoVs was similar to that of their paired bat CoVs within lineages and underwent strong purifying selection after jumping to the human host. For the early human‐to‐human transmission of North American SARS‐CoV‐2, we identified that two SNPs of complete linkage disequilibrium were exclusively present in more than half of the North American SARS‐CoV‐2‐dominated lineage B.1. By retrospectively tracing isolates with these two SNPs, an early transmission history of North American SARS‐CoV‐2 isolates was reconstructed.

MATERIALS AND METHODS

Data acquisition

Human SARS‐CoV‐2 isolated from Wuhan, China, was obtained from the NCBI database (GenBank No.: MN908947.3). To identify the phylogenetically closed bat CoVs associated with human SARS‐CoV‐2, the BLAST searching tool of the NCBI viral database was used. Based on the constructed phylogenetic tree, we identified two bat SARS‐like CoVs (referred to here as bat SARS‐CoV‐2) (GenBank No: MG772933.1 and MG772934.1) that were evolutionarily close to human SARS‐CoV‐2 (Table S1) (Fig. S1). For human SARS‐CoVs and MERS‐CoVs, similar approaches were used to identify their bat CoV pairs; the camel MERS‐CoV‐related literature was also reviewed (Madani et al., 2014). For the GWAS, full‐genome sequences of global SARS‐CoV‐2 collected from 12 December 2019 to 24 April 2020 (8:00 GMT +8) were archived from the database of the GISAID Initiative EpiCoV platform (GISAID; https://www.epicov.org). A total of 8,480 sequences were archived and filtered by criteria, including high coverage only (> 29,000 bp, 1X coverage of genome), exclusion of low coverage and sequences with unconfident bases (N) inside. The PANGOLIN isolate (EPI_ISL_410539) and bat CoVRaTG13 isolate were also used for phylogenetic analysis. The identical sequences were further removed by CD‐HIT software (version 4.8.1, parameters: ‐aL 1 ‐aS 1 ‐c 1 ‐s 1) (Huang et al., 2010). A final 2,599 sequences were used in this study (data file S1).

Sequence alignment

A codon‐based Cluster W method was used for the multiple sequence alignment and identification of the ribosomal frameshift motifs among the three CoV bat–human pairs. The genomic organization of annotated CoVs was visualized by SnapGene (Version 4.2.4).

Phylogenetic analysis

For the accessory and necessary protein‐coding genes, phylogenetic trees were constructed by the maximum likelihood method. The evolutionary distances were calculated by the differences between amino acid substitutions or nucleotide substitutions per site. Confidence probability was estimated by the bootstrap test (100 replicates). The 2,599 full‐genome sequences were aligned by MAFFT software (version 7.407, algorithm: FFT‐NS‐2) (Nakamura et al., 2018). The phylogenetic tree was constructed by IQ‐TREE 2 (version 2.1.2, parameter: ‐nt ‐gtr ‐gamma) using the GTR+Γ model of nucleotide substitution (Minh et al., 2020). The phylogenetic tree was visualized by FigTree (version 1.4.4, http://tree.bio.ed.ac.uk/software/figtree/). Each descendant lineage was annotated according to criteria from recent publications (Rambaut et al., 2020).

Mutation analysis

Synonymous and non‐synonymous differences per sequence between human and bat CoVs were estimated using the Nei‐Gojobori model by MEGA‐X software (Table S2–11). The dN/dS ratio is an indicator of directional selection: a ratio above 1 implies positive selection (nature), a ratio less than 1 implies negative selection (purifying), and a ratio equal to 1 indicates no selection (neutral). The dN/dS ratio (Table S12–16) is calculated by the following equations:

Codon usage bias analysis

Relative synonymous codon usage (RSCU) of CoV necessary protein‐coding genes (i.e. ORF1ab‐S‐E‐M‐N) was analysed by CODONW software (http://www.molbiol.ox.ac.uk/cu, version 1.4.2) using standard genetic codes. The linear regression of RSCU between bat–human CoV pairs was analysed by GraphPad Prism 8.0.

SNP calling

SNPs and INDEL polymorphisms were detected by MUMmer software (version 3.0, nucmer, show‐snps) (Kurtz et al., 2004) using the Wuhan‐Hu‐1 strain (GISAID: EPI_ISL_402125, GenBank: NC_045512.2) as a reference genome. To validate the identity of the resulting polymorphisms, raw reads (40 out of 2,599 strains, NCBI SRA database) were analysed by the bwa program (version 0.7.16a) (H. Li & Durbin, 2010) and the mpileup program of the SAMtools software (version 1.10) (H. Li, 2011). The validation was consistent with the polymorphisms detected by MUMmer software.

GWAS and linkage disequilibrium (LD) analysis

To identify causative SNPs in the population of North American SARS‐CoV‐2 (cases = 1,063, controls = 1536), a geographically stratified genome‐wide association study of 5,312 mutations was performed using PLINK software (version 1.90) (Purcell et al., 2007). The empirical threshold of the p‐value was suggested to be 9.41 × 10–6 (0.05/5312 = 9.41 × 10–6) calculated by the (Benjamini & Hochberg, 1995), but we further increased the threshold of the p‐value to 1.00 × 10–15 to detect the most causative SNPs. The top 21 significant SNPs were listed (Table 1), and the LD of pairing SNPs was estimated and visualized by Haploview software (version 4.1) (Barrett et al., 2005).

TABLE 1

List of top 21 hits of causative SNPs

#	SNP	Gene	Effects	p‐value	SNP Frequencies (%)
#	SNP	Gene	Effects	p‐value	North America	Lineage B*	Lineage B.1 *	Europe	Asia
1	25,563.G > T	ORF3a	Missense(Gln57His)	2.98E−261	574/1063(54%)	574/818(70%)	573/691(83%)	119/951(13%)	19/448(4%)
2	1,059.C > T	ORF1ab	Missense(Thr265Ile)	2.44E−212	479/1063(45%)	479/818(59%)	479/691(69%)	94/951(10%)	0/448(0%)
3	17,858.A > G	ORF1ab	Missense(Met5865Val)	3.57E−117	194/1063(18%)	0/818(0%)	0/691(0%)	0/951(0%)	0/448(0%)
4	17,747.C > T	ORF1ab	Synonymous	1.43E−116	193/1063(18%)	0/818(0%)	0/691(0%)	0/951(0%)	4/448(1%)
5	18,060.C > T	ORF1ab	Missense(Ser5932Phe)	2.42E−113	196/1063(18%)	1/818(0%)	0/691(0%)	0/951(0%)	0/448(0%)
6	29,553.G > A	ORF10	Upstream	7.26E−51	80/1063(8%)	80/818(10%)	80/691(12%)	0/951(0%)	30/448(7%)
7	28,882.G > A	N	Synonymous	6.15E−39	53/1063(5%)	53/818(6%)	53/691(8%)	207/951(22%)	30/448(7%)
8	28,883.G > C	N	Missense(Gly204Arg)	6.15E−39	53/1063(5%)	53/818(6%)	53/691(8%)	207/951(22%)	30/448(7%)
9	28,881.G > A	N	Missense(Arg203Lys)	2.77E−38	54/1063(5%)	54/818(7%)	54/691(8%)	207/951(22%)	87/448(19%)
10	28,144.T > C	ORF8	Missense(Leu84Ser)	9.73E−33	245/1063(23%)	0/818(0%)	0/691(0%)	49/951(5%)	0/448(0%)
11	27,964.C > T	ORF8	Missense(Ser24Leu)	2.74E−29	47/1063(4%)	47/818(6%)	47/691(7%)	0/951(0%)	2/448(0%)
12	11,916.C > T	ORF1ab	Missense(Ser3884Leu)	4.87E−29	51/1063(5%)	51/818(6%)	51/691(7%)	0/951(0%)	96/448(21%)
13	8,782.C > T	ORF1ab	Synonymous	4.03E−28	245/1063(23%)	0/818(0%)	0/691(0%)	53/951(6%)	9/448(2%)
14	15,324.C > T	ORF1ab	Missense(Thr5020Ile)	5.83E−27	2/1063(0.2%)	2/818(0%)	2/691(0%)	80/951(8%)	130/448(29%)
15	11,083.G > T	ORF1ab	Missense(Leu3606Phe)	3.99E−25	71/1063(7%)	65/818(8%)	5/691(1%)	92/951(10%)	2/448(0%)
16	18,998.C > T	ORF1ab	Missense(His6245Tyr)	1.10E−22	41/1063(4%)	41/818(5%)	41/691(6%)	0/951(0%)	2/448(0%)
17	29,540.G > A	ORF10	Upstream	1.10E−22	41/1063(4%)	41/818(5%)	41/691(6%)	0/951(0%)	8/448(2%)
18	18,877.C > T	ORF1ab	Synonymous(His6245Tyr)	2.31E−19	65/1063(6%)	63/818(8%)	62/691(9%)	10/951(1%)	0/448(0%)
19	29,711.G > T	5'UTR	Downstream	4.30E−19	31/1063(3%)	31/818(4%)	1/691(0%)	0/951(0%)	1/448(0%)
20	1604.AATG>A	ORF1ab	Deletion(delTGA)	2.87E−17	4/1063(0.4%)	4/818(0%)	0/691(0%)	69/951(7%)	2/448(0%)
21	27,046.C > T	M	Missense(Thr175Met)	3.48E−17	2/1063(0.2%)	2/818(0%)	2/691(0%)	60/951(6%)	60/951(13%)

The 'bold values' refers to the two SNPs used for the reconstruction of early transmission history.

North America lineage B or B.1.

List of top 21 hits of causative SNPs The 'bold values' refers to the two SNPs used for the reconstruction of early transmission history. North America lineage B or B.1.

SNP accumulating analysis

To analyse the trend of SNP accumulation during March 2020, the frequencies of average SNP accumulation per day were counted. This same trend in 1,059.C > T and 25,563.G > T in North American SARS‐CoV‐2 and that of the other continents was traced by the date of occurrence of the two SNPs. The same analysis of these two SNPs in North American lineage B.1 was also conducted. These analyses were performed by Microsoft® Excel 2016 (data file S2).

Statistical analysis

The probability of rejecting the null hypothesis of strict neutrality (dN = dS) in favour of the alternative hypothesis (dN < dS) was calculated by the codon‐based Z test of purifying selection. Data from the SNP accumulation analysis were plotted by GraphPad (Version 8.2.1). The mean differences of all types of SNP accumulation per day per strain were determined by the Mann–Whitney U test (interval = 10 days) (R version 3.6.2). p values less than .05 were considered significant.

RESULTS

Genomic organization of bat–human CoV pairs

CoVs are positive single‐stranded RNA viruses with a non‐segmented genome. The genome encodes a fixed array of necessary proteins (NPs), in the order ORF1ab, spike (S) protein, envelope (E) protein, membrane glycoprotein (M) protein and nucleocapsid (N), as well as accessory proteins (APs) that differ by number and order among closely related CoVs (Figure 1). We found that the genomic organization of each bat–human CoV pair was similar， as well as that of MERS‐CoV between humans, bats and camels. For the NPs, the genomic organization among all three CoVs followed the same order (i.e. ORF1ab‐S‐E‐M‐N) (Figure 1). The loci of the APs were largely different, which was likely caused by gene recombination (Figure 1). Specifically, for SARS‐CoV‐2 and SARS‐CoV, the ORF6, ORF7 and ORF8 genes are equally inserted between the M gene and the N gene. However, for MERS‐CoV, this location has no gene insertion.

FIGURE 1

Genomic signatures of bat–human SARS‐CoV‐2, SARS‐CoV and MERS‐CoV pairs. For the three CoV pairs, the necessary proteins are encoded in the same order in the genome (i.e. ORF1ab‐S‐E‐M‐N). The accessory protein‐encoding genes vary by locus. SARS‐CoV‐2 and SARS‐CoV have gene insertions between the M protein and N protein‐coding genes, such as ORF6, ORF7a(b) and ORF8, while the same locus has no insertion in MERS‐CoV. Of note, a novel ORF10 (yellow box) of human SARS‐CoV‐2 is less related to any human or CoV gene. Importantly, ORF6 and ORF7a of human SARS‐CoV‐2 are related to the equivalents of SARS‐CoV isolated from both bats and humans (lower panel). Two new hypothetical proteins (i.e. HP1 and HP2) (yellow box) of bat SARS‐CoV‐2 are evolutionarily close to ORF9a and ORF9b of SARS‐CoV. Bat‐SARS‐CoV‐2 #1 and #2 represent two related bat CoVs. The phylogenetic tree was constructed by the maximum likelihood method. The evolutionary distances are calculated by base differences per site. Confidence probability was estimated using the bootstrap test (100 replicates)

Identical ribosomal frameshift motif between bat–human CoV pairs

For all CoVs, a programmed −1 ribosomal frameshift signal is essential, as it controls viral translation. The slippage signal is characterized by an X3Y3Z motif (X3, any three identical nucleotides; Y3, typically UUU or AAA; Z, A, C or U). We found that the slippage signal, U_UUA_AAC, was identical among SARS‐CoV‐2, SARS‐CoV and MERS‐CoV (Baranov et al., 2005) (Fig. S4). The slippage of ribosomal frameshifting from U_UUA_AAC to UUU_AAA_C does not change the growing peptides, as they both encode identical dipeptides because of the degeneracy of codon position 3. For the two flanking motifs, the 5’ attenuator hairpin and 3’ frameshift‐stimulating three‐stemmed pseudoknots are relatively conserved in both SARS‐CoV‐2 and SARS‐CoV. However, inside the second flanking motif of MERS‐CoV, an insertion of the AAT codon (encoding asparagine) was newly identified (Fig. S4). Prior research conducted by mutating this slippage signal to C_CUC_AAC shows thorough inhibition of the ribosomal frameshift (Kelly et al., 2020).

Strong purifying selection of CoVs

Zoonotic transfer of CoVs involves mutagenesis and directional selection (Forni et al., 2017). During the 2002–2004 epidemic, SARS‐CoV mutated extensively, which enhanced its virulence (Consortium, 2004). To measure which type of mutation is linked to CoV virulence, synonymous differences (dSs) and non‐synonymous differences (dNs) were analysed between intro‐ and extra‐branches (Figure 2). The number of dSs between CoVs of intro‐branches is much higher than that of dNs, which is similar among all NPs (i.e. ORF1ab‐S‐E‐M‐N) (Table S2–11). For instance, for ORF1ab, the number of dSs between bat and human SARS‐CoV‐2 (dS = 1875.75) is approximately four times higher than the number of dNs (dN = 441.25). The trend is the same for SARS‐CoV and MERS‐CoV. The number of dSs and dNs within intro‐branches of CoVs is smaller than the number within extra‐branches. This is consistent with the evolutionary distances of phylogenetic trees (Figure 2).

FIGURE 2

Evolutionary signatures of necessary protein‐encoding genes of SARS‐CoV‐2. The differences between amino acid substitutions were used to construct phylogenetic trees (left panel). The same analysis was also performed by the differences in nucleotide substitutions (middle panel). dN/dS ratio matrixes are displayed (Tables S12–16) (right panel). The sequential analysis of necessary proteins was performed from top to bottom (i.e. ORF1ab‐S‐E‐M‐N). The phylogenetic tree was constructed by the maximum likelihood method. Bat‐SARS‐CoV‐2 #1 and #2 represent two related bat CoVs. The evolutionary distances are calculated by the differences between amino acid substitutions or nucleotide substitutions per site. Confidence probability was estimated using the bootstrap test (100 replicates) Because massively synonymous mutations do not modify the protein sequence but change the overall codon usage bias, we performed linear regression of relative synonymous codon usage (RSCU) to see whether bat CoVs show codon usage bias to human CoVs. However, only a slight RSCU shift from bat‐to‐human CoVs was observed, as the slope of the linear regression was slightly smaller than 1 (Fig. S5). Collectively, the large synonymous mutations were augmented, which may enhance SARS‐CoV‐2 virulence, similar to SARS‐CoV, 2003. The dN/dS ratio is a classic indicator of directional selection: a ratio above 1 implies positive selection (nature), a ratio less than 1 implies negative selection (purifying), and a ratio equal to 1 indicates no selection (neutral) (Kryazhimskiy & Plotkin, 2008). In contrast, purifying selection involves more synonymous mutations than non‐synonymous mutations. As discussed, this is true for the three bat–human CoV pairs (Table S2–11). Purifying selection primarily changes viral codon usage bias and thus can regulate viral virulence via optimization of a specific codon context (Coleman et al., 2008; Hanson & Coller, 2017). For ORF1ab, the dN/dS ratio between human and bat SARS‐CoV‐2 is 0.05, which is much lower than that of SARS‐CoV‐2 and MERS‐CoV (Table S12–16). A similar trend was confirmed in the other NPs. This is supported by a viral culture experiment, as SARS‐CoV‐2 grows better than SARS‐CoV and MERS‐CoV in human cells (Perlman, 2020). Using the codon‐based Z test of selection, the statistic shows that human SARS‐CoV‐2 undergoes significantly strong purifying selection (Table S12–16).

Phylogenetics of global SARS‐CoV‐2 reveals that early North American isolates dominate lineage B.1

To understand the early human‐to‐human transmission of SARS‐CoV‐2 in North America, a phylogenetic analysis of the global SARS‐CoV‐2 population (2,599 strains with high confidence) was conducted. We found that global SARS‐CoV‐2 was rooted in two lineages, lineages A (n = 413) and B (n = 2,186), in which North American isolates dominated lineage B (n = 818) and lineage B.1 (n = 691) (Figure 3 and Table 1) (Rambaut et al., 2020). Importantly, the phylogenetic tree was inferred by producing mutations, and the identification of key mutations can provide clues for tracing the transmission route of SARS‐CoV‐2.

FIGURE 3

Phylogenetic tree of early global SARS‐CoV‐2. The 2,599 full‐genome sequences were used to construct the phylogenetic tree via the maximum likelihood (ML) method using IQ‐TREE 2 software (version 2.1.2, model: GTR+Γ). Accordingly, the early SARS‐CoV‐2 isolates were rooted in two lineages, lineages A (n = 413) and B (n = 2,186), in which North American isolates dominated lineage B (n = 818) and sub‐lineage B.1 (n = 691). The constituents of the main lineages A and B as well as lineage B.1 are displayed by three pie charts. Specifically, North American and European isolates dominate lineage B.1

Geographic GWAS reveals SNPs associated with North American isolates

Calling key SNPs from massive mutations of the SARS‐CoV‐2 population requires a GWAS that has been learned from a human GWAS (Power et al., 2017). Because of the complexity of the phylogenetic tree, a phylogenetically stratified GWAS may not be feasible. Therefore, a geographically stratified GWAS was carried out, as the geographic location of individual isolates was reliable. The mutation features of SARS‐CoV‐2 between continents may reflect the incidence of emergence of a given viral population in different human hosts (Rambaut et al., 2020). By using a geographically stratified GWAS comparing North American isolates (n = 1,063) with the remaining isolates (n = 1536), we found 21 significant SNPs or small insertion deletions (INDELs) out of 5,312 (threshold p‐value =1.00 × 10–15) (Figure 4a). Specifically, the top two SNPs (i.e. 1,059.C > T and 25,563.G > T) were present in approximately half of North American SARS‐CoV‐2 isolates (479/1063 = 45% and 574/1063 = 54%), particularly North American lineage B.1 (479/691 = 69% and 573/691 = 83%) (Table 1). Interestingly, the two SNPs were in complete linkage disequilibrium, suggesting that the two SNPs concurrently occurred in the North American dominating lineage B.1 (479/691, 69%) (Figure 4b). Importantly, the two SNPs resulted in two mutations (i.e. Thr265 Ile and Gln57 His) in ORF1ab and ORF3a, respectively.

FIGURE 4

GWAS and linkage disequilibrium (LD) analysis. (a) Manhattan plot comparing the North American SARS‐CoV‐2 isolates (n = 1,063) to the isolates from the remaining continents (n = 1536). Genomic coordinates are displayed along the X‐axis, and ‐log10 of the association p‐value for each SNP is displayed on the Y‐axis (threshold p‐value =1.00 × 10–15). Different blocks indicate the different protein‐encoding regions. (b) Linkage disequilibrium between SNPs in SARS‐CoV‐2. LD plot of any two SNP pairs among the 21 sites. The number near slashes shows the genomic coordinates. The colour in the square is given by the standard (D'/LOD), and the number in the square is the r2 value Among these 21 SNPs, we also identified two previously reported SNPs, 8,782.C > T and 28,144.T > C (p‐value = 4.03 × 10–28 and 9.73 × 10–33), resulting in a synonymous mutation and a missense mutation (Leu 84 Ser) (Tang et al., 2020) (Table 1). Interestingly, three sequential SNP sites (28881–3.GGG>ACA) were fixed in 22% (207/951) of the European SARS‐CoV‐2 isolates, resulting in a synonymous mutation and two missense mutations (Arg 203 Lys and Gly 204 Arg) (Table 1). Tracing these three SNPs showed that the recent reemergence of COVID‐19 in the Xinfadi market in Beijing, China, was associated with European isolates (Wenjie et al., 2020).

SNP tracing reconstructed an early transmission history of North American isolates

In the North American SARS‐CoV‐2 population, 45% of strains have these two SNPs, and 69% have these SNPs for North American lineage B.1 (Table 1). Because of the high occurrence of the two SNPs, tracing the two SNPs may provide a reliable transmission route of SARS‐CoV‐2 in the major North American human population. We thus performed a retrospective tracing study in our high confidential data sets (2,599 flittered strains) to identify the time order of isolates occurring at the two SNPs on all continents and in lineage B.1. We found that the first isolate started in Europe (26 February 2020) 3 days earlier than the occurrence date of the North American isolates (29 February 2020) and 17 days earlier than the Asian isolates (Taiwan China dominated) (13 March 2020) (Figure 5a). By further tracing the accumulating frequencies per day of the two SNPs during mid to late March, we found that North American lineage B.1 highly accumulated these two SNPs (Figure 5b). In addition, the mean number of all accumulating SNPs during mid to late March was significantly lower than that before or after the same period (Fig. S6). This evidence indicated that the two SNPs were strongly selected in the North American SARS‐CoV‐2 isolates, in particular lineage B.1 from mid to late March. The accumulation of the two SNPs may explain the sharp increase in confirmed cases in North America before early April (WHO reported) (WHO, 2020).

FIGURE 5

Retrospectively tracing the early SARS‐CoV‐2 isolates with SNPs (1,059.C > T & 25,563.G > T) of all continents and North American lineage B.1. (a) The time‐dependent accumulating plot for frequencies of the two SNPs (1,059.C > T & 25,563.G > T) between continents. The continents are labelled by different colours, and the continents of the first occurrence of the two SNPs are indicated. (b) The time‐dependent accumulation plot of North American lineage B.1. Of note, the two SNPs largely and concurrently accumulated during mid to late March and occurred most frequently in isolates of North American lineage B.1 (479/691, 69.31%) (Table 1) SARS‐CoV‐2 is thought to be transmitted from wildlife to humans before human‐to‐human transmission occurs (Andersen et al., 2020; Shi et al., 2020; Tang et al., 2020). We carefully checked whether bat or pangolin CoVs evolved with the two mutations before they jumped to the human species. However, we did not find either of these SNPs in the bat‐ or pangolin‐related CoVs (P. Zhou et al., 2020) (Fig. S7). Alternatively, in bat or pangolin CoVs, the 1,059 site has no or a C > A variant, and the 25,563 site has a G > A variant instead (Fig. S7). (P. Zhou et al., 2020).

DISCUSSION

Herein, we identified the genetic signatures of bat‐to‐human CoVs and specified an early transmission history of North American SARS‐CoV‐2. Although human CoVs are highly similar to bat CoVs by sequence and genome organization (Perlman, 2020; P. Zhou et al., 2020), several specific genetic signatures were newly identified in this study, such as a unique ORF10 in human SARS‐CoV‐2, an identical ribosomal frameshift motif, and strong purifying selection after zoonotic transfer. In addition, we also found that the two causative SNPs that were present in approximately half of the North American SARS‐CoV‐2 isolates represented 69% of the isolates of North American lineage B.1 (Rambaut et al., 2020). The early transmission history of the major North American SARS‐CoV‐2 isolates was reconstructed by tracing the occurrence date of isolates with these two SNPs, and transmission started in Europe, North America, South America, and later Asia and Oceania. The genetic signature and its extent determine the bat‐to‐human cross‐species transmission of CoVs, which is still largely undocumented. However, distinctive genetic signatures can possibly predict and estimate the risk of zoonotic transmission. The unique ORF10 of human SARS‐CoV‐2 and the insertion of the AAT codon in the slippage signal of MERS‐CoV could serve as novel targets for differential diagnosis. Importantly, human SARS‐CoV‐2 as well as SARS‐CoV and MERS‐CoV undergo strong purifying selection. Strong purifying selection involving large synonymous mutations may promote fitness in the human system by regulating viral translation efficiency (Ou et al., 2018). Monitoring the mutation rate in particular synonymous mutations and its possible impact would help to predict the risk of zoonotic transfer of bat CoVs. Following zoonotic transfer, understanding the trend of SARS‐CoV‐2 geographic transmission is important for public measures (Centers for Disease Control and Prevention‐USA, 2020). This study was published as a BioRvix preprint and gained the attention of certain medical communities, such as the Centers for Disease Control and Prevention of the USA and Washington State Department of Health (Centers for Disease Control and Prevention‐USA, 2020; Ou et al., 2020; Washington State Department of Health, 2020). Regardless of the early transmission history, the genetic signatures identified may help methodology development for precisely tracing viral transmission in real time, such as via the two causative SNPs. These two SNPs pose a great possibility for epidemiological surveillance in the North American population due to their high prevalence in the same population. In clinical diagnosis, these SNPs may improve methodology development to specifically detect North American isolates, such as via SNP‐based allele‐specific polymerase chain reaction (ASPCR) (Corman et al., 2020; Ugozzoli & Wallace, 1991). It has been reported that SARS‐CoV‐2 with the D614G mutation in the S protein increases infectivity in human lung cells (Yurkovetskiy et al., 2020). The two SNPs we identified are responsible for two missense mutations that probably change the protein structure and function to some extent. More mechanical investigations of the functional impact caused by these two SNPs would be enhanced in this aspect, as they are possibly druggable targets. The hard lesson of the ongoing SARS‐CoV‐2 pandemic is its strain of the global public health system (Ji et al., 2020). Before the pandemic, researchers detected the proximal origin of SARS‐CoV‐2 in bats from 2015 to 2017 (Hu et al., 2018). However, its risks to public health were largely ignored. A platform for early surveillance and risk estimation of bat CoVs is very much needed. In the future, it is hoped that these exclusively genetic signatures may help public health surveillance and measures.

ETHICAL APPROVAL

This study was approved by Sichuan Agricultural University Ethical Committee. No clinical data, animal and human material has been disclosed or used in this study.

CONFLICT OF INTEREST

The authors declare no conflict of interest.

AUTHOR CONTRIBUTIONS

X.O., Q.P. and A.C. devised the project and the main conceptual ideas; X.O., Z.Y., D.Z., S.M. and M.W. acquired the data; X.O. and Z.Y. designed and performed the computations; X.O., Z.Y., D.Z., S.M., R.J., S.C., M.L. and Q.Y. analysed and interpreted the data; X.O. and Z.Y. drafted the manuscript; Y.W., X.Z., S.Z., J.H., Q.G., Y.L., L.Z., M.P. and Q.P. proofread the draft; and all authors approved the final version of the manuscript. Supplementary Material Click here for additional data file. Data S1 Click here for additional data file. Data S2 Click here for additional data file.

35 in total

1. Molecular evolution of the SARS coronavirus during the course of the SARS epidemic in China.

Authors:
Journal: Science Date: 2004-01-29 Impact factor: 47.728

2. Virus attenuation by genome-scale changes in codon pair bias.

Authors: J Robert Coleman; Dimitris Papamichail; Steven Skiena; Bruce Futcher; Eckard Wimmer; Steffen Mueller
Journal: Science Date: 2008-06-27 Impact factor: 47.728

3. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

Authors: Heng Li
Journal: Bioinformatics Date: 2011-09-08 Impact factor: 6.937

Review 4. Codon optimality, bias and usage in translation and mRNA decay.

Authors: Gavin Hanson; Jeff Coller
Journal: Nat Rev Mol Cell Biol Date: 2017-10-11 Impact factor: 94.444

5. Programmed ribosomal frameshifting in decoding the SARS-CoV genome.

Authors: Pavel V Baranov; Clark M Henderson; Christine B Anderson; Raymond F Gesteland; John F Atkins; Michael T Howard
Journal: Virology Date: 2005-02-20 Impact factor: 3.616

6. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR.

Authors: Victor M Corman; Olfert Landt; Marco Kaiser; Richard Molenkamp; Adam Meijer; Daniel Kw Chu; Tobias Bleicker; Sebastian Brünink; Julia Schneider; Marie Luisa Schmidt; Daphne Gjc Mulders; Bart L Haagmans; Bas van der Veer; Sharon van den Brink; Lisa Wijsman; Gabriel Goderski; Jean-Louis Romette; Joanna Ellis; Maria Zambon; Malik Peiris; Herman Goossens; Chantal Reusken; Marion Pg Koopmans; Christian Drosten
Journal: Euro Surveill Date: 2020-01

7. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology.

Authors: Andrew Rambaut; Edward C Holmes; Áine O'Toole; Verity Hill; John T McCrone; Christopher Ruis; Louis du Plessis; Oliver G Pybus
Journal: Nat Microbiol Date: 2020-07-15 Impact factor: 17.745

8. CD-HIT Suite: a web server for clustering and comparing biological sequences.

Authors: Ying Huang; Beifang Niu; Ying Gao; Limin Fu; Weizhong Li
Journal: Bioinformatics Date: 2010-01-06 Impact factor: 6.937

9. Another Decade, Another Coronavirus.

Authors: Stanley Perlman
Journal: N Engl J Med Date: 2020-01-24 Impact factor: 91.245

10. Structural and Functional Analysis of the D614G SARS-CoV-2 Spike Protein Variant.

Authors: Leonid Yurkovetskiy; Xue Wang; Kristen E Pascal; Christopher Tomkins-Tinch; Thomas P Nyalile; Yetao Wang; Alina Baum; William E Diehl; Ann Dauphin; Claudia Carbone; Kristen Veinotte; Shawn B Egri; Stephen F Schaffner; Jacob E Lemieux; James B Munro; Ashique Rafique; Abhi Barve; Pardis C Sabeti; Christos A Kyratsous; Natalya V Dudkina; Kuang Shen; Jeremy Luban
Journal: Cell Date: 2020-09-15 Impact factor: 66.850

2 in total

1. Tracing genetic signatures of bat-to-human coronaviruses and early transmission of North American SARS-CoV-2.

Authors: Xumin Ou; Zhishuang Yang; Dekang Zhu; Sai Mao; Mingshu Wang; Renyong Jia; Shun Chen; Mafeng Liu; Qiao Yang; Ying Wu; Xinxin Zhao; Shaqiu Zhang; Juan Huang; Qun Gao; Yunya Liu; Ling Zhang; Maikel Peppelenbosch; Qiuwei Pan; Anchun Cheng
Journal: Transbound Emerg Dis Date: 2021-05-28 Impact factor: 4.521

2. Targeting the YXXΦ Motifs of the SARS Coronaviruses 1 and 2 ORF3a Peptides by In Silico Analysis to Predict Novel Virus-Host Interactions.

Authors: Athanassios Kakkanas; Eirini Karamichali; Efthymia Ioanna Koufogeorgou; Stathis D Kotsakis; Urania Georgopoulou; Pelagia Foka
Journal: Biomolecules Date: 2022-07-29

2 in total