Literature DB >> 34292617

Evolution trace of SARS-CoV-2 from January 19 to March 12, 2020, in the United States.

Ziying Lin¹, Hua Qing¹, Rui Li¹, Lei Zheng², Huipeng Yao¹.

Abstract

As a kind of human betacoronavirus, SARS-CoV-2 has endangered globally public health. As of January 2021, the virus had resulted in 2,209,195 deaths. By studying the evolution trend and characteristics of 265 SARS-CoV-2 strains in the United States from January to March, it is found that the strains can be divided into six clades, USA clade-1, USA clade-2, USA clade-3, USA clade-4, USA clade-5, and USA clade-6, in which US clade-1 may be the most ancestral clade, USA clade-2 is an interim clade of USA clade-1 and USA clade-3, the other three clades have similar codon usage pattern, while USA clade-6 is the newest and most adaptable clade. Mismatch analysis and protein alignment showed that the evolution of the clades arises from some special mutations in viral proteins, which may help the strain to invade, replicate, transcribe and so on. Compared with previous research and classifications, we suggest that clade O in GISAID should not be an independent clade and Wuhan-Hu-1 (EPI_ISL_402125) should not be an ancestral reference sequence. Our study decoded the evolutionary dynamic of SARS-CoV-2 in the early stage from the United States, which give some clues to infer the current evolution trend of SARS-CoV-2 and study the function of viral mutational protein.

Entities: Chemical

Keywords: SARS-CoV-2; clade; coronavirus; evolution

Mesh：

Year: 2021 PMID： 34292617 PMCID： PMC8426869 DOI： 10.1002/jmv.27225

Source DB: PubMed Journal: J Med Virol ISSN： 0146-6615 Impact factor: 20.693

INTRODUCTION

SARS‐CoV‐2, a member of the genus betacoronavirus that infects humans, which can not only cause a series of mild symptom such as high inflammation and microangiopathy, but also develop into severe symptoms like extensive thrombosis and severe acute respiratory syndrome, endangering patient's life. Since it first identified in December 2019 in Wuhan, China, the virus has spread globally, resulting in the currently ongoing COVID‐19 pandemic. As of January 31, 2021, there have been 102,083,344 confirmed cases of COVID‐19 with 2,209,195 deaths globally, and 25,676,612 confirmed cases with 433,173 deaths in the United States, reported to WHO (https://covid19.who.int/). SARS‐CoV‐2 genome is an about 30 kb positive‐sense single‐stranded RNA and firstly published in January 2020, which is made up of 14 open reading frames (ORFs), ORF1a, ORF1b, spike (S), ORF3a, envelope (E), membrane (M), ORF6, ORF7a, ORF7b, ORF8, nucleocapsid (N), ORF10, ORF9, and ORF14, encoding corresponding proteins. The 14 proteins are classified into three kinds, polyproteins, main structural proteins and accessory proteins. SARS‐CoV‐2 has two kinds of polyproteins, polyprotein 1a and polyprotein 1b which are encoded by two overlapping ORFs, ORF1a and ORF1b, occupying 2/3 of its genome. , Polyprotein 1a is proceed into 10 nonstructural proteins (Nsps), Nsp1 to Nsp10 by viral protease Nsp5 and the papain‐like protease domain from Nsp3. Nsp1 can interfere with the host innate immune response, by lowing the expression of some immune protein factors after binding to the 40S ribosomal subunit and destroying its messenger RNA (mRNA). Nsp3 hydrolyze the sequence LXGG↓X, resulting in producing Nsp1, Nsp2, Nsp3, from ORF1a and blocking the activation of innate immune by deubiquitinating and deISG15ylating. , , , Through the capacity, after binding to ADP‐ribose, Nsp3 removes the chemical group from the ADP‐ribosylated proteins in immune response. , , In addition, Nsp3 can induce double‐membrane vesicles outside viral replication‐transcriptional complexes (RTC) with Nsp4 and Nsp6. Nsp5, as the main protease, can hydrolyze the sequence (L/V/F)Q↓(S/A/G), catalyzing polyprotein 1ab to produce Nsp4 to Nsp10 and Nsp12 to Nsp16, 12 proteins. Nsp7 and Nsp8 form a heterodimer which binds to RNA dependent RNA polymerase, stabilizing the enzyme polymerase domain, increasing the affinity of RNA binding to the enzyme significantly and enhancing the catalytic activity. , , Nsp9 is a kind of single stranded RNA‐binding protein, which may influence the virulence of the virus. Nsp10 assists methyltransferase to form a methylation complex which caps the virus mRNA and enhances the enzyme activity. Nsp12 is a kind of RNA dependent RNA polymerase (RdRp), which catalyze the replication and transcription of virus genome after forming RTC with other Nsps. Beside, with Nsp8, it can regulate the activity of helicase. Nsp13 is a kind of helicase, participating in virus replication and preservation in the life activity of virus. Nsp14 has a N‐terminal exonuclease domain and a C terminal guanine‐N7 methyl transferase domain, in which, the former is responsible for proofreading new error nucleotides and the latter is involved in capping the virus RNA from the degradation of the host immune. , , Nsp15 is a kind of ribonucleic acid endonuclease, which can prevent the host from detecting the virus double stranded RNA to escape the attack of the immune system. Nsp16 is a specific methyl transferase, which catalyzes 2'‐O‐methylation of the first nucleotide in the viral capped RNA to be protected from the degradation of host innate immune response. , SARS‐CoV‐2 has spike (S), membrane (M), envelope (E), nucleocapsid (N), four kinds of Nsps. S protein helps the virus enter host cells by binding to human ACE2 receptors, , , and induces inflammatory response after recognized by TLR4. N protein is responsible for packaging viral RNA into helical ribonucleocapsid, forming viral nucleocapsid structure with M protein. , , E may play an important role in virus maturation, transmission and reproduction. Besides, the virus has several accessory proteins, such as: ORF3a, ORF6, ORF7a, ORF7b, ORF8, ORF9, and ORF10. As a transmembrane protein, ORF3a is related to virion release and viral pathogenicity. , ORF7a may play a role in protein transport mediated by endoplasmic reticulum and Golgi complex. The exogenous overexpression of ORF8 in cells can destroy IFN‐I signal from the host. ORF9b, as a part of N protein, inhibits host immune response. SARS‐CoV‐2 mutates rapidly, producing a large number of lineages or clades by different methods. As of February 12, 2020, Yu et al. found that SARS‐CoV‐2 is classified into five groups including 58 haplotypes, in which, H13 and H35 were ancestral haplotypes and H1 (which from the Hua Nan market) was derived from the H3 haplotype. Two months later, Peter et al. proposed an interesting viewpoint that the virus should be divided into three major variants, Types A, B, and C, in which Type B appear in East Asia while other types are mainly distributed outside of East Asia but Type A was the ancestor type. By using 64 whole genome sequences from December 30, 2019, to March 9, 2020, in Europe, North America, South America, and Southeast Asia, Maías et al. suggests that SARS‐CoV‐2 is divided into three genetic clades cocirculating in all over the world and the most recent common ancestor may appear in around November 1, 2019. On March 31, 2020, Jennifer et al. firstly classified 10,959 viral isolates into lineages A and B, in which the genomes of lineage A is characterized by 8782T and 28144C, while lineage B by 8782C and 28144T. Each sublineages is designated or defined by an additional unique mutations and can further diversify into sublineages. On May 2, Priyanka et al. found that SARS‐CoV‐2 can be classified into two groups, G1 and G2, in which the former is restricted to moderate warm climate and the later start to spread neighboring cold climate and hot climate of the tropics. Houriiyah et al. had identified 16 new lineages, including B.1.1.54, B.1.1.56, and C.1 spreading widely in South Africa after analyzing 1365 whole genomes of SARS‐CoV‐2 isolate between March 6 and August 26, 2020. BII/GIS from A*STAR Singapore analyzed all SARS‐CoV‐2 sequences as of January 22, 2021 and found that 373,805 full genomes excluding 21,358 low coverage entries can be classified into S, L, V, G, GH, GR, GV, 7 clades, in which clade S is ancestral and clade O is lacking. Changtai et al. identified 13 variation sites in ORF1a, ORF1b, S, ORF3a, M, ORF8, and N regions, on February 14, 2020. Maria et al. found eight new mutations at positions 1397, 2891, 14,408, 17,746, 17,857, 18,060, 23,403, and 28,881 in the viral genomic sequences worldwide from December 2019 to mid‐March 2020. As of April 19, 2020, Van Dorpan Dorp et al. found that many recurrent mutations occurred in S protein, Nsp6, Nsp11, and Nsp3 and nonsynonymous mutations account for nearly 80%. Takahiko et al. found that C3037T synonymous sites in genome, P4715L in ORF1ab and D614G in S protein were the most frequent mutations, after aligning 10,022 SARS CoV‐2 genomes between February 1, and May 1, 2020. On May 7, 2020, Yujiro et al. found that P4715L of ORF1ab and D614G of S protein are linkage and lethality. In July 2020, Domenico et al. found two mutations at Nsp6 position 37 and ORF 10 position 3 or 4, reducing the two proteins structure stability. In July 2020, Phan found three mutations (N354D, D364Y and V367F) on the surface of S protein, which may change its conformation, resulting in changes in antigenicity of the virus. On October 2, 2020, Sarmilah et al. found that for S protein, the N501Y mutation was more infectious than the D614G mutation. On December 18, 2020, Yixuan et al. found that after the mutation D614G in S protein, the variant has more effective infection, replication and competitive adaptability in human primary airway epithelial cells. In this study, we firstly try to use all ORFs combined sequence to analyze the phylogenetical relation of the strains in the United States from January 19 to March 12, 2020. Then, to test our phylogenetic tree we ran a series of follow‐up analysis. We calculate each ORF nucleotide substitution rates, analyze its codon usage pattern and population expansion and align each corresponding protein sequences. Finally, we compare our phylogenetical result with others, to reveal the evolution dynamics of SARS‐CoV‐2 in early stage of US epidemic outbreak, assess the characteristics of its classification and understand the trend of the epidemic.

MATERIALS AND METHODS

Data acquisition

On May 12, 2020, flagged as “complete (>29,000 bp)” and “high coverage,” 683 SARS‐CoV‐2 genomes from January 19 to March 12 in the United States were downloaded from the GISAID Initiative EpiCoV platform. Filtering any sequences with N, W, and other missing site produces a final data set of 265 genomes. In addition, two reference ancestral sequences, hCoV‐19/bat/Yunnan/RaTG13/2013 and hCoV‐19/Wuhan‐Hu‐1/2019, are also downloaded from the GISAID. All voucher information of 267 strains can be found in Table S1.

ORFs finding and combining

With the aid of the reference sequence, hCoV‐19/Wuhan‐Hu‐1/2019, the tool of ORFfinder in linux x64 is used to look for the meaning ORFs from each genome. Because ORF9 gene is a part of the nucleocapsid gene, and the sequence and function of ORF14 gene can not been found in NCBI and GISAID, 12 ORFs or genes from all genomes are screened. According to the order of the genes on the genome, 12 ORFs sequences were spliced into a large assembly sequence with end to end mode, producing 267 assemblies.

Maximum likelihood (ML) tree and Bayesian phylogenetic tree

Taking the assembly sequence of hCoV‐19/bat/Yunnan/RaTG13/2013 as the reference sequence, the other 266 assemblies is used to build a phylogenetic tree by the software of MEGA‐X with the ML statistical method (Model, Tamura‐Nei; Bootstrap, 500). Bayesian phylogenetic analysis can be used to infer the divergency time of each clade, further determining the virus evolutionary trend. The 266 assemblies are used to build a Bayesian phylogenetic tree by using BEAST 2 software with GTR model. The clock model chooses Relaxed Clock Log Normal and the chain length of MCMC is 100,000,000. The priors model use the Coalescent Bayesian Skyline, and statistical uncertainty in the data was reflected by the 95% highest posterior density values. Using the TRACER program v1.7.1, it is found that one hundred million generations were produced after a burn‐in of 10 million steps, assessed by effective sample sizes (ESS) over 200. The tree was processed by TreeAnnotator v1.10.4 and viewed in program FigTree v1.4.4.

Mismatch distribution

To characterize of the virus gene evolution, DNAsp is used to analyze mismatch distribution of each gene and assemblies of all SARS‐CoV‐2 strains in the United States. Mismatch distribution is a way to visually reflect the historical dynamics of the population. If the mismatch curve shows a unimodal Poisson distribution, it is generally accepted that the population size has experienced expansion or continuous growth. On the other side, if the curve coincides with the expectation curve, the population size remains stable in the past.

Protein alignment and nucleotide substitution rates calculation

Twelve viral proteins of all strains are aligned to find out the common and different characteristics of each clade produced by the phylogenetic tree. For the same purpose, we calculate Ka (synonymous substitution rate), Ks (nonsynonymous substitution rate) and ω (the ratio of nonsynonymous substitution rate to synonymous substitution rate) of different genes in each clade of 265 virus strains in United States by MEGA‐X (model, Nei‐Gojobori).

Principal component analysis (PCA)

To determine the codon usage pattern of SARS‐CoV‐2 in different clades, CodonW 1.4.2 was used to calculate the values of relative synonymous codon usage (RSCU) of 265 assemblies. Basing on the RSCU, by the software of SPSSv21.0, we calculate axis 1 and axis 2 infecting the codon usage pattern which are plotted by Excel 2010 according to axis 1 as horizontal axis, the second axis as vertical axis.

RESULTS

Establishment of clades and estimation of evolutionary time

ML tree and the Bayesian phylogenetic tree were used to classify 265 SARS‐CoV‐2 strains in the United States and estimate the evolutionary time of each clade. According to Figure 1, it is seen that the strains are divided into six main independent clades (USA clade‐1 to USA clade‐6), among which Wuhan‐Hu‐1 belongs to USA clade‐3, indicating it is not appropriate to be as the reference ancestor sequence in many papers. , , , , In the phylogenetic tree, it is found that USA clade‐1 is the closest to the ancestors (RaTG13) in kinship, followed by USA clade‐2, USA clade‐3, USA clade‐4, USA clade‐5, and USA clade‐6. In addition, as the largest clade, USA clade‐6 includes 76 virus strains, followed by USA clade‐1 (53), USA clade‐3 (46), USA clade‐5 (40), USA clade‐4 (30), and USA clade‐2 (20). It seem to indicate that compared with USA clade‐2, the youngest clade (USA clade‐6) and the oldest clade (USA clade‐1) are more suitable to United States at the time. The virus large population expansion may occur in the initial and later stage of the its evolution and the virus mutated and evolved very rapidly.

Figure 1

Phylogenetic trees of 265 SARS‐CoV‐2 strains in the United States from January 19 to March 12, 2020, which is rooted by RaTG13 and compared with Wuhan‐Hu‐1. The yellow colour represent ancestor RaTG13, red represent USA clade‐1, green represent USA clade‐2, orange represent USA clade‐3, blue represent USA clade‐4, pink represent USA clade‐5, purple represent USA clade‐6, and grey represent Wuhan‐Hu‐1. The details of each clade are shown in Figure S1 Bayesian phylogenetic tree is showed as Figure 2, in which it is found that about 80 days ago, the common ancestor of all virus clades may begin to appear on December 24, 2019. About 11 days later, the ancestors of both USA clade‐1 and USA clade‐2 start to appear, but USA clade‐1 seem to diverge from USA clade‐2 on January 16, 2020. In the other aspect, USA clade‐3 emerges around January 13, 2020 and it is also the appearance date of the common ancestor of USA clade‐4 to USA clade‐6, which is diverged into USA clade‐4 on February 22, USA clade‐5 on January 25 and USA clade‐6 on February 26. The details of each clade are in Figure S2.

Figure 2

Bayesian phylogenetic tree of 265 SARS‐CoV‐2 strains in the United States from January 19 to March 12, 2020, showing the divergency time of six clades. The red colour represent USA clade‐1, green represent USA clade‐2, orange represent USA clade‐3, blue represent USA clade‐4, pale pink represent USA clade‐5 and USA clade‐6 use purple to represent. The details of each clade are shown in Figure S2

Population expansion analysis

Mismatch distribution can be used to seek for the trace of virus population expansion. The mismatch distribution of the assemblies and different genes are shown as Figures 3 and S4. From Figure 3, it is known the mismatch distribution curve shows a multipeak state, suggesting that the assemblies used to have experienced several dominant mutations in SARS‐CoV‐2 evolution, which may help itself better adapt to the environment at that time. Combining Figures S4A, S4B, S4C, and S4D in Figure S4, it is seen that the mismatch distribution curves of ORF1a, ORF1b, S and ORF3a show a pattern of unimodal Poisson distribution, which means that the four genes had experienced a rapid variation expansion in the past and the genes mutants have been significantly increasing. From protein analysis in Tables 1 and S2, it is known that the ORF1a mutation T265I may play an important role in the evolution from USA clade‐5 to USA clade‐6, the same as the ORF1b mutation P4715L from USA clade‐3 to USA clade‐4, S mutation D614G from USA clade‐3 to USA clade‐4 and the ORF3a mutation Q57H in from USA clade‐4 to USA clade‐5. Linked to the function of the proteins, it is speculated that the four mutations proteins may help SARS‐CoV‐2 easily bind to the host and beneficially proliferate themselves. On the contrary, from the other figures, E to K in Figure S4, it is known that mismatch distribution curves of other genes are similar to their expectation curves, indicating that these genes are relatively conservative in the evolution of SARS‐CoV‐2.

Figure 3

Mismatch distribution analyses of assemblies sequences alignment of 265 SARS‐CoV‐2 strains

Table 1

Characteristics of amino acids mutations of different genes among six clades of 265 SARS‐CoV‐2 strains in the United States from January 19 to March 12, 2020

	Mutation sites
Gene	ORF1a	ORF1a	ORF1a	ORF1b	S	ORF3a	ORF3a	ORF8	ORF8	ORF8	N	N
Position	75	265	971	4715	614	57	251	24	62	84	203	204
USA clade‐1(53)	D53	T53	P53	P53	D53	Q53	G53	S53	V53	S53	R53	G53
USA clade‐2(20)	E14D6	T20	L14P6	P20	D20	Q20	G20	S20	L15V5	S20	R20	G20
USA clade‐3(46)	D46	T46	P46	P46	D46	Q46	G31V15	S46	V46	L46	R46	G46
USA clade‐4(30)	D30	T30	P30	L30	G30	Q30	G30	S30	V30	L30	R21K9	G21R9
USA clade‐5(40)	D40	T40	P40	L40	G40	H40	G40	S40	V40	L40	R40	G40
USA clade‐6(76)	D76	I76	P76	L76	G76	H76	G76	S69L7	V76	L76	R76	G76
RATG13	D	T	P	P	D	Q	G	S	V	S	R	G
Wuhan‐Hu‐1	D	T	P	P	D	Q	G	S	V	L	R	G

Note: Bold font represent mutation sites of six USA clades. The capital letters represents an amino acids. The numbers after clades and amino acids represent the number of virus strains. “Position” means the site of an amino acid in gene shown in the preceding row. The details can be found in Table S2.

Mismatch distribution analyses of assemblies sequences alignment of 265 SARS‐CoV‐2 strains Characteristics of amino acids mutations of different genes among six clades of 265 SARS‐CoV‐2 strains in the United States from January 19 to March 12, 2020 Note: Bold font represent mutation sites of six USA clades. The capital letters represents an amino acids. The numbers after clades and amino acids represent the number of virus strains. “Position” means the site of an amino acid in gene shown in the preceding row. The details can be found in Table S2.

Viral proteins sequences alignment

Proteins sequences are aligned to find out the common and different characteristics of each clade. Twelve viral proteins alignments of 267 strains are seen as Tables S2 and 1, in which it is found that compared with RaTG13, both E and ORF6 of 265 strain in United States have no mutation site, but the second amino acid, S mutates into I in ORF7b and the 35th amino acid, L mutates into F in ORF10, which may be related to the choice of the host. Each clade from phylogenetic tree has special mutation sites characteristics. Compared with the other clades and RaTG13, USA clade‐2 have some unique mutation sites (most of members), such as the 75th amino acid (D75E), the 265 amino acid (P971L) in ORF1a and the 62 amino acid (V62L) in ORF8. However, from Tables 1 and S2, it is easily seen that USA clade‐1 is same as the ancestral protein sequence in these mutation sites, seeming to indicate USA clade‐1 is the oldest clade, just as Figure 1. Except for both USA clade‐1 and USA clade‐2, it is found that the number of mutation sites gradually increases with the appearance of USA clade‐3 to USA clade‐6 and the 84th amino acid, S mutates into L in ORF8 in the clades. In addition, the mutation site (G215V, G31V15) in ORF3a is the unique clade characteristics of USA clade‐3. Compared with ancestral clades, in USA clade‐4 to USA clade‐6, the 4,715th amino acid, P mutates into L in ORF1b and the 614th amino acid, D mutates into G in protein S, which form a pair of coupling mutation sites. USA clade‐4 diverges into USA clade‐5 and USA clade‐6, represented by the mutation site, Q57H in ORF3a. USA clade‐6 is diverged from USA clade‐5, with both the sites of T265I site in ORF1a and S24L(S69L7) in ORF8 as a symbol. Moreover, a pair of coupling mutations, R203K and G204R in protein N are found in USA clade‐4.

Nucleotide substitution rates analysis

The synonymous substitution rate (Ks), nonsynonymous substitution rate (Ka) and the ω ratio (ω = Ka/Ks) can be used to infer the pressure of evolution in genes of SARS‐CoV‐2. According to the theory of Michael et al., Ks < 0.05 may indicate that genes experience a phase of accelerated proteins evolution in their early evolution history, followed by the period of a gradual increase in selective constraint, with the progressive decline of ω. Ks, Ka, and ω of viral gene from each clade can be seen as Figure 4 and Table 2, from which it is found that Ks of each gene is far <0.05 in all clades, indicating the viral genes are in the early stage of the evolution. For example, Ks of ORF8 is zero in all clades and Ka is also zero in five clades, suggesting ORF8 gene may be in the earliest period of evolutionary process. It is the same as other genes, such as ORF3a, ORF7a, M in most of clades, and S protein in clade‐4 to clade‐6. In the period, SARS‐CoV‐2 is experiencing accelerated evolution of some proteins in several clades, such as S in clade‐4 to clade‐6 (Ks = 0, Ka > 0). In addition, for each viral gene, the value of Ks is steady among different clades except for ORF1b in clade‐2 and clade‐3, partly verifying the assumption that synonymous substitutions are largely immune from selection and accumulate at a stochastic rate that is proportional to time. On the other aspect, similar to Ks, the value of Ka is also steady among all clades except for ORF1a in clade‐2 and clade‐3. Except for the genes of Ks = 0 or Ka = 0, the value of ω is smaller than 1, meaning that most of genes from SARS‐CoV‐2 were mainly affected by purifying selection. From Figure 4, it is observed that for ORF1b in USA clade‐2, USA clade‐3 and N in USA clade‐4, the values of ω are higher, indicating both of genes are under greater purifying pressure. For ORF1a, ω progressively decline from USA clade‐2 to USA clade‐6, which may mean that the strain gradually adapts to the environment with evolutionary process.

Figure 4

Table 2

A
	Synonymous substitution rate
Clade	ORF1a	ORF1b	S	ORF3a	ORF7a	N	M	ORF8
USA clade‐1	0.000393400	0.000587098	0.001191506	0	0.011363854	0.003365121	0	0
USA clade‐2	0.000334869	0.000863252	0.001528091	0	0	0.003492679	0.005893940	0
USA clade‐3	0.000356284	0.000981839	0.001108461	0	0	0.003403186	0	0
USA clade‐4	0.000411639	0.000560392	0	0.005102210	0.011363854	0.003724469	0	0
USA clade‐5	0.000406571	0.000560495	0	0	0	0.003365121	0	0
USA clade‐6	0.000397937	0.000567960	0	0.005110869	0	0	0	0

Ks: Nucleotide synonymous substitution rates (A), Ka: nonsynonymous substitution rates (B); ω: the ratio of nonsynonymous substitution rate to synonymous substitution rate (C) of different clades in each gene and the assemblies Ks: Nucleotide synonymous substitution rates (A), Ka: nonsynonymous substitution rates (B); ω: the ratio of nonsynonymous substitution rate to synonymous substitution rate (C) of different clades in each gene and the assemblies

PCA analysis

PCA analysis is usually used to analyze the potential evolutionary trends of genes codon usage patterns. PCA analysis plot of 265 assemblies is shown in Figure 5, from which it is seen that USA clade‐1 distributes in the first quadrant, USA clade‐2 distributes near the horizontal positive axis, USA clade‐3 in the fourth quadrant, and most points of USA clade‐4 to clade‐6 in the second and third quadrant. The distribution zones of USA clade‐1 and USA clade‐3 are relatively independent, but USA clade‐2 seem to partly overlap with the area of USA clade‐1 or USA clade‐3, suggesting that USA clade‐2 may be a interim clade of USA clade‐1 to USA clade‐3 in the virus evolution. USA clade‐4, USA clade‐5 and USA clade‐6 concentrate together, indicating the clades have many common mutation sites or amino acid sequences. There is no overlap between USA clade‐6 and three ancestral clades (USA clade‐1 to USA clade‐3), implying compared to other clades, USA clade‐6 is obviously different from its ancestors and it was also the newest product of SARS‐CoV‐2 evolution.

Figure 5

PCA in each clade of the assemblies. A plot with Axis 1 against Axis 2 was plotted based on RSCU values of all genes. Different clades are represented by different colors and shapes. PCA, principal component analysis; RSCU, relative synonymous codon usage

DISCUSSION

In the study, we analyzed the evolutionary characteristics of SARS‐CoV‐2 in United States from January 19 to March 12, 2020 and found the strains are classified into six clades (USA clade‐1 to USA clade‐6). The common ancestor of the clades may appear on December 24, 2019. Though Bayesian analysis seems to indicate that USA clade‐1 diverged from USA clade‐2, other more arguments suggest that the former is closer to the ancestors, such as phylogenetic tree, mutation sites similarity and codon usage pattern from PCA analysis. In other words, USA clade‐1 should be the oldest clade and the ancestor of SARS‐CoV‐2 in United States. The number of strains in USA clade‐1 also indicates the clade used to experience a scale of expansion and be more suitable to the environment at the time. USA clade‐2 is the closest to USA clade‐1 and USA clade‐3 in kinship (Figure 1), partly overlaps with both clades in codon usage pattern (Figure 5), and has the fewest members, which suggests that USA clade‐2 is a interim clade of the evolution process from clade‐1 to clade‐3. Compared with traditional classification, it is easily found that USA clade‐1 and USA clade‐2 can merge into S clade in GISAID website (Figure 6), in which the 8782nd nucleotide is T and the 28,144th is C, in accordance with Lineage A in the study of Jennifer et al. Compared with both the clades, the 84th amino acid, S mutates into L in ORF8 and the number of mutation sites gradually increases in USA clade‐3 to USA clade‐6, characterized by lineage B in Jennifer et al. According to the classification in GISAID website, USA clade‐3 is composed of the strains of L clade and V clade. USA clade‐4 is composed of G clade and GR clade. USA clade‐5 and USA clade‐6 are composed of GH clade. However, clade O is scattered among the four clades (Table S3 and Figure 6). The study of BII/GIS from A*STAR Singapore as of January 22, 2021 demonstrates that clade S is the ancestor, from which the clades of L and V are diverged, followed by clade G, clade GH, clade GR, clade GV and so on. Yu et al. also found that H13 and H35 (belonging to S clade) were ancestral haplotypes, which is simlar to our conclusion.

Figure 6

Phylogenetic tree and clades of the global viruses as of January 22, 2021, acquired in GISAID

Phylogenetic tree and clades of the global viruses as of January 22, 2021, acquired in GISAID Wuhan‐Hu‐1 (EPI_ISL_402125) is always regarded as the reference sequence or the ancestor sequence in phylogenetic analysis of SARS‐CoV‐2 strains , , , , but the strain belongs to USA clade‐3 in our research, the clade of L in GISAID website, Lineage B in Jennifer et al., which suggest that it is necessary to seek for a new strain as the source of ancestral or reference sequence. Combining 24 genome sequences in December 2019, we think the candidate should be EPI_ISL_529213. In other words, SARS‐CoV‐2 may have spread widely in some unknown manners before outbreak. The characteristic mutation sites in different clades may take part in the viral replication and transmission. For example, during the evolution from the first three clades (USA clade‐1, USA clade‐2, and USA clade‐3) to the last three clades (USA clade‐4, USA clade‐5, and USA clade‐6), the mutation D614G in S protein has been proved to strengthen viral infection and viral replication in human epithelial cells and the strong linkage with P4715L of ORF1ab showed significant positive correlations with fatality rates. Similarly, during the evolution from the first four clades to the last two clades, the mutation Q57H in ORF3a protein may enhance the capacity of virion release and viral pathogenicity. , During the evolution from the first two clades to the last four clades, the mutation S84L in ORF8 protein may help virus escape from immune response. During the evolution from other clades to the USA clade‐6, the mutation T265I in ORF1a protein may help SARS‐CoV‐2 to form an advantage group. Moreover, for USA clade‐4, newly strong linkage of R203K and G204R in N protein may help virus to for helical ribonucleocapsid All in all, we do not only describe the evolutionary process in six clades of SARS‐CoV‐2 in the early stage from the United States, but also identify their evolutionary direction and characteristics, which supplements the strain classification and helps to infer the current evolution trend of SARS‐CoV‐2. In addition, we newly find many special mutations in viral proteins in different clades, which lay foundation to study the function of viral mutational protein.

AUTHOR CONTRIBUTION STATEMENT

Conceptualization, methodology, software, formal analysis, investigation, writing—original draft, visualization: Ziying Lin. Data curation, formal analysis, and visualization: Hua Qing. Software and data curation: Rui Li. Software and data curation: Lei Zheng. Validation, supervision, and writing—original draft: Huipeng Yao. Supporting information. Click here for additional data file. Supporting information. Click here for additional data file. Supporting information. Click here for additional data file. Supporting information. Click here for additional data file. Supporting information. Click here for additional data file. Supporting information. Click here for additional data file.

55 in total

1. The evolutionary fate and consequences of duplicate genes.

Authors: M Lynch; J S Conery
Journal: Science Date: 2000-11-10 Impact factor: 47.728

Review 2. Properties of Coronavirus and SARS-CoV-2.

Authors: Y A Malik
Journal: Malays J Pathol Date: 2020-04 Impact factor: 0.656

3. Identification of severe acute respiratory syndrome coronavirus replicase products and characterization of papain-like protease activity.

Authors: Brian H Harcourt; Dalia Jukneliene; Amornrat Kanjanahaluethai; John Bechill; Kari M Severson; Catherine M Smith; Paul A Rota; Susan C Baker
Journal: J Virol Date: 2004-12 Impact factor: 5.103

4. Sixteen novel lineages of SARS-CoV-2 in South Africa.

Authors: Houriiyah Tegally; Eduan Wilkinson; Richard J Lessells; Jennifer Giandhari; Sureshnee Pillay; Nokukhanya Msomi; Koleka Mlisana; Jinal N Bhiman; Anne von Gottberg; Sibongile Walaza; Vagner Fonseca; Mushal Allam; Arshad Ismail; Allison J Glass; Susan Engelbrecht; Gert Van Zyl; Wolfgang Preiser; Carolyn Williamson; Francesco Petruccione; Alex Sigal; Inbal Gazy; Diana Hardie; Nei-Yuan Hsiao; Darren Martin; Denis York; Dominique Goedhals; Emmanuel James San; Marta Giovanetti; José Lourenço; Luiz Carlos Junior Alcantara; Tulio de Oliveira
Journal: Nat Med Date: 2021-02-02 Impact factor: 53.440

5. Severe acute respiratory syndrome coronavirus nsp9 dimerization is essential for efficient viral growth.

Authors: Zachary J Miknis; Eric F Donaldson; Timothy C Umland; Ryan A Rimmer; Ralph S Baric; L Wayne Schultz
Journal: J Virol Date: 2009-01-19 Impact factor: 5.103

6. Delicate structural coordination of the Severe Acute Respiratory Syndrome coronavirus Nsp13 upon ATP hydrolysis.

Authors: Zhihui Jia; Liming Yan; Zhilin Ren; Lijie Wu; Jin Wang; Jing Guo; Litao Zheng; Zhenhua Ming; Lianqi Zhang; Zhiyong Lou; Zihe Rao
Journal: Nucleic Acids Res Date: 2019-07-09 Impact factor: 16.971

Review 7. A Structural View of SARS-CoV-2 RNA Replication Machinery: RNA Synthesis, Proofreading and Final Capping.

Authors: Maria Romano; Alessia Ruggiero; Flavia Squeglia; Giovanni Maga; Rita Berisio
Journal: Cells Date: 2020-05-20 Impact factor: 6.600

8. Structural basis for inhibition of the RNA-dependent RNA polymerase from SARS-CoV-2 by remdesivir.

Authors: Wanchao Yin; Chunyou Mao; Xiaodong Luan; Dan-Dan Shen; Qingya Shen; Haixia Su; Xiaoxi Wang; Fulai Zhou; Wenfeng Zhao; Minqi Gao; Shenghai Chang; Yuan-Chao Xie; Guanghui Tian; He-Wei Jiang; Sheng-Ce Tao; Jingshan Shen; Yi Jiang; Hualiang Jiang; Yechun Xu; Shuyang Zhang; Yan Zhang; H Eric Xu
Journal: Science Date: 2020-05-01 Impact factor: 47.728