Literature DB >> 32966646

Phylogenetic analysis of SARS-CoV-2 in the first few months since its emergence.

Matías J Pereson^1,2, Laura Mojsiejczuk^1,2, Alfredo P Martínez³, Diego M Flichman^2,4, Gabriel H Garcia¹, Federico A Di Lello^1,2.

Abstract

During the first few months of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) evolution in a new host, contrasting hypotheses have been proposed about the way the virus has evolved and diversified worldwide. The aim of this study was to perform a comprehensive evolutionary analysis to describe the human outbreak and the evolutionary rate of different genomic regions of SARS-CoV-2. The molecular evolution in nine genomic regions of SARS-CoV-2 was analyzed using three different approaches: phylogenetic signal assessment, emergence of amino acid substitutions, and Bayesian evolutionary rate estimation in eight successive fortnights since the virus emergence. All observed phylogenetic signals were very low and tree topologies were in agreement with those signals. However, after 4 months of evolution, it was possible to identify regions revealing an incipient viral lineage formation, despite the low phylogenetic signal since fortnight 3. Finally, the SARS-CoV-2 evolutionary rate for regions nsp3 and S, the ones presenting greater variability, was estimated as 1.37 × 10-3 and 2.19 × 10-3 substitution/site/year, respectively. In conclusion, results from this study about the variable diversity of crucial viral regions and determination of the evolutionary rate are consequently decisive to understand essential features of viral emergence. In turn, findings may allow the first-time characterization of the evolutionary rate of S protein, crucial for vaccine development.

Entities: Chemical

Keywords: SARS-CoV-2; evolution; evolutionary rate; phylogeny

Mesh：

Substances：

Year: 2020 PMID： 32966646 PMCID： PMC7537150 DOI： 10.1002/jmv.26545

Source DB: PubMed Journal: J Med Virol ISSN： 0146-6615 Impact factor: 20.693

INTRODUCTION

Coronaviruses belong to Coronaviridae family and have a single strand of positive‐sense RNA genome of 26–32 kb in length. They have been identified in different avian hosts as well as in various mammals including bats, mice, dogs, etc. , Periodically, new mammalian coronaviruses are identified. In late December 2019, Chinese health authorities identified groups of patients with pneumonia of an unknown cause in Wuhan, Hubei Province, China. The pathogen, a new coronavirus called severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2), was identified by local hospitals using a surveillance mechanism for “pneumonia of unknown etiology.” , , The pandemic spread rapidly, and >28 million confirmed cases and nearly 900,000 deaths were reported in just an 8‐month period. The rapid viral spread raised interesting questions about the way its evolution is driven during the pandemic. From the SARS‐CoV‐2 genome, 16 nonstructural proteins (nsp1‐16), 4 structural proteins (spike [S], envelope [E], membrane [M], and nucleoprotein [N]), and other proteins essential to complete the replication cycle have been translated. , The large amount of currently available information allows knowing, as never before, the real‐time evolution history of a virus since its interspecies jump. Most studies published to date have characterized the viral genome and evolution by analyzing complete genome sequences. , , , Despite this, until now, the viral genomic region providing the most accurate information to characterize SARS‐CoV‐2 could not be established. This lack of information prevents from investigating its molecular evolution and monitoring of biological features, affecting the development of antiviral drugs and vaccines. Therefore, the aim of this study was to perform a comprehensive viral evolutionary analysis to describe the human outbreak and the molecular evolution rate of different genomic regions of SARS‐CoV‐2.

MATERIALS AND METHODS

Data sets

To generate a data set representing different geographic regions and time evolution of the SARS‐CoV‐2 pandemic from December 2019 to April 2020, data of all the complete genome sequences available at Global Initiative on Sharing All Influenza Data (GISAID) (https://www.gisaid.org/) on April 18, 2020 were collected. Data inclusion criteria were as follows: (a) complete genomes, (b) high coverage level, and (c) human hosts only (no other animals, cell culture, or environmental samples). Complete genomes were aligned using MAFFT against the Wuhan‐Hu‐1 reference genome (NC_045512.2, EPI_ISL_402125). The resulting multiple sequence alignment (Data set 1) was split in nine data sets corresponding to nine coding regions: (a) four structural proteins (envelope [E], nucleocapsid [N], spike [S], and Orf3a), (b) four nonstructural proteins (nsp1, nsp3, Orf6, and nsp14), and (c) an unknown function protein (Orf8). More than 6000 SARS‐CoV‐2 publicly available nucleotide sequences were downloaded. After selection of data according to the inclusion criteria, 1616 SARS‐CoV‐2 complete genomes were included in Data set 1. Sequences of Data set 1 came from 55 countries belonging to the five continents as follows: Africa: 39 sequences, Americas: 383 sequences, Asia: 387 sequences, Europe: 686 sequences, and Oceania: 121 sequences. After elimination of sequences with indeterminate or ambiguous positions, the number of analyzed sequences for each region was as follows: nsp1, 1608; nsp3, 1511; nsp14, 1550; S, 1488; Orf3a, 1600; E, 1615; Orf6, 1616; Orf8, 1612; and N, 1610. Finally, nucleotide sequences were grouped by fortnight (FN) according to their collection date. Table 1 summarizes the number of sequences per fortnight since the beginning of the pandemic up to FN8. However, Data set 2 was created using only variable sequences of each region analyzed in Data set 1. Thus, Data set 1 was used for the analysis of amino acid substitutions and Data set 2 was used for the phylogenetic signal analysis and the Bayesian coalescent trees' construction.

Table 1

The number of SARS‐CoV‐2 sequences by fortnight (temporal structure)

Fortnight	Date	Median of analyzed sequences (Q1–Q3)
FN1	12/24/2019–12/31/2019	15
FN2	01/01/2020–01/15/2020	19
FN3	01/16/2020–01/31/2020	145 (136–145.5)
FN4	02/01/2020–02/15/2020	119 (113–120)
FN5	02/16/2020–03/02/2020	258 (247–259)
FN6	03/03/2020–03/17/2020	403 (390–406)
FN7	03/18/2020–04/01/2020	447 (416–450)
FN8	04/02/2020–04/17/2020	199 (197–201)
Total		1488–1616

Note: The total number of sequences is variable, depending on the analyzed region (nsp1, 1608; nsp3, 1511; nsp14, 1550; S, 1488; Orf3a, 1600; E, 1615; Orf6, 1616; Orf8, 1612; and N, 1610).

Abbreviations: FN, fortnight; Q1, quartile 1; Q3, quartile 3.

The number of SARS‐CoV‐2 sequences by fortnight (temporal structure) Note: The total number of sequences is variable, depending on the analyzed region (nsp1, 1608; nsp3, 1511; nsp14, 1550; S, 1488; Orf3a, 1600; E, 1615; Orf6, 1616; Orf8, 1612; and N, 1610). Abbreviations: FN, fortnight; Q1, quartile 1; Q3, quartile 3.

Phylogenetic signal

To determine the phylogenetic signal of each of the nine generated alignments, Likelihood Mapping analyses were carried out, using the Tree Puzzle v5.3 program and the Quartet puzzling algorithm. This algorithm allowed analyzing the tree topologies that can be completely solved from all possible quartets of the n alignment sequences using maximum likelihood. An alignment with defined tree values >70%–80% presents strong support from the statistical point of view. Identical sequences were also removed with ElimDupes (available at https://www.hiv.lanl.gov/content/sequence/elimdupesv2/elimdupes.html), as they increase computation time and provide no additional information about dated phylogeny. The best‐fit evolutionary model to each data set was selected on the basis of the Bayesian Information Criterion obtained with the JModelTest v2.1.10 software.

Analysis of amino acid substitutions

Entropy‐one (available at https://www.hiv.lanl.gov/content/sequence/ENTROPY/entropy_one.html) was used in determining the frequency of amino acids at each position for the nine genomic regions analyzed and evaluating their permanence in the eight investigated fortnights in Data set 1.

Bayesian coalescence and phylogenetic analysis

To study the relationship among SARS‐CoV‐2 sequences, nine regions of the viral genome were investigated by Bayesian analyses. Phylogenetic trees were constructed using Bayesian inference with MrBayes v3.2.7a. Each gene was analyzed independently with the same data set used for the phylogenetic signal analysis, so that non‐identical sequences were included in the analysis. Analyses were run for five million generations and sampled every 5000 generations. Convergence of parameters (effective sample size [ESS] ≥ 200, with a 10% burn‐in) was verified with Tracer v1.7.1. Phylogenetic trees were visualized with FigTree v1.4.4.

Evolutionary rate

The estimation of the nucleotide evolutionary rate was made with the Beast v1.10.4 program package. Analyses were run at the CIPRES Science Gateway server. In total, 312 sequences without indeterminations corresponding to the nsp3 (5835 nt) and S (3822 nt) genes were randomly selected from Data set 1. The sequences represent all the fortnights and most of the geographical locations sampled until April 17. Temporal calibration was performed by the date of sampling. The appropriate evolutionary model was selected as described above for phylogenetic signal analysis. The nucleotide substitution TIM model was used for nsp3 and HKY model for S. Analyses were carried out under a relaxed (uncorrelated lognormal) molecular clock model, as suggested by Duchene et al. and with an exponential demographic, appropriate for early viral samples from an outbreak. Independent runs were performed for each data set, and a Markov chain Monte Carlo technique with a length of 1.3 × 109 steps, sampling every 1.3 × 106 steps, was utilized. The convergence of the “mean rate” parameter (effective sample size [ESS] ≥ 200, burn‐in 10%) was verified with Tracer v1.7.1. Additionally, to verify the obtained results, 15 independent replicates of the analysis were performed with the time calibration information (date of sampling) randomized, as described by Rieux and Khatchikian. Finally, the obtained parameters for real data and the randomized replicates were compared.

RESULTS

Using bioinformatics tools, a phylogenetic signal study was carried out to identify the most informative SARS‐CoV‐2 genomic regions. The likelihood mapping analysis showed that most genes have a very poor phylogenetic signal with high values in the central region that represents the area of unresolved quartets (Figure 1). Accordingly, genes could be separated into three groups: the first group with little or no phylogenetic signal (E, Orf6, Orf8, nsp1, and nsp14), the second group with a low phylogenetic signal (Orf3a and N), and the last group with a relatively more phylogenetic signal (S and nsp3), but still low to be considered a robust one (unresolved quartets >40%).

Figure 1

The phylogenetic signal for SARS‐CoV‐2 data sets. The presence of the phylogenetic signal was evaluated by likelihood mapping, unresolved quartets (center), and partly resolved quartets (edges) for genomes available on April 17 for the nine analyzed regions: nsp1 (29 sequences), nsp3 (225 sequences), nsp14 (65 sequences), S (183 sequences), Orf3a (74 sequences), E (11 sequences), Orf6 (12 sequences), Orf8 (23 sequences), and N (113 sequences). The presence of a strong phylogenetic signal (<40% unresolved quartets) was not observed for any region The analysis of amino acid substitutions by fortnights was useful to study the viral evolutionary dynamics in the context of the beginning of the pandemic. When analyzing amino acid sequences from different time periods, changes were observed in 5 of 9 genomic regions and only in 14 of 4975 (0.28%) evaluated residues. In most of the regions, except nsp1, nsp14, E, and Orf6, 2–6 amino acids emerge since FN3 and remain unchanged until the end of the follow‐up period (Table 2). Particularly, in the Orf8 region, early selection of two amino acid substitutions (V62L and L84S) was observed in FN2. However, in the S region, the D614G substitution started with <2% in FN3 and FN4 and reached 88% in the last fortnight. In a similar way, the Q57H (Orf3a) substitution increased from 6% to 34%, whereas selection of L84S (Orf8) substitution started in FN2 and reached 6% at FN8. The R203K and G204R substitutions from the N region emerged in FN4 and increased their population proportion to values >20% toward the end of the follow‐up period. Moreover, the emergence of a great number of sporadic substitutions that remain in the population for a short period (1–3 fortnights) was observed in the nine analyzed regions. Indeed, 333 (6.83%) positions from the total analyzed presented at least one substitution throughout the eight fortnights. Table 3 summarizes the number of variable positions, number of mutations, and number of sequences with mutations by region.

Table 2

Amino acids selected by region and fortnight. The number indicates the amino acid location in its protein

		Amino acid percentage by FN
Region	Amino acid substitution	FN1	FN2	FN3	FN4	FN5	FN6	FN7	FN8
nsp3	A58T	0	0	0	1.0	6.0	3.0	3.0	2.5
	P135L	0	0	0.8	0	0	1.5	0.5	2.5
S	D614G	0	0	1.5	1.8	37.0	64.0	75.0	88.0
Orf3a	Q75H	0	0	0	0	6.0	22.0	23.0	34.0
	G196V	0	0	0	0	0.8	4.0	0.9	0.5
	G251V	0	0	8.0	24.0	8.0	9.0	10.0	3.0
Orf8	V62L	0	5.0	1.0	3.3	0.0	1.5	1.3	3.0
Orf8	L84S	0	42.0	37.0	21.0	21.0	18.0	7.0	6.0
N	P13L	0	0	0	0	1.0	1.0	2.5	0.5
	S197L	0	0	0	0	1.1	5.0	0.9	0.5
	S202N	0	0	3.5	4.2	0	0.5	2.2	2.5
	R203K	0	0	0	0	17.0	19.0	24.0	23.0
	G204R	0	0	0	0	17.0	19.0	24.0	23.0
	I292T	0	0	0	0	2.0	0.2	0.2	0.5

Note: Only regions where amino acid change was selected and remained until the last analyzed fortnight are shown.

Abbreviation: FN, fortnight.

Table 3

The number of variable positions, number of mutations, and number of sequences with mutation by region

Region	No. of variable aa positions (%)	No. of aa substitutions	No. of sequences with aa substitutions (%)
nsp1 (180aa)	3 (1.7)	37	37 (2.4)
nsp3 (1945aa)	158 (8.1)	322	294 (19.3)
nsp14 (527aa)	6 (1.4)	83	83 (5.5)
S (1273aa)	76 (5.9)	1013	904 (59.4)
Orf3a (275aa)	11 (4)	491	468 (30.7)
E (75aa)	5 (6.7)	6	6 (0.4)
Orf6 (60aa)	7 (11.6)	9	9 (0.6)
Orf8 (121aa)	14 (11.6)	312	288 (18.9)
N (419aa)	53 (12.6)	760	470 (30.9)
Total (4875aa)	333 (6.8)	3033	–

Abbreviation: aa, amino acid.

Amino acids selected by region and fortnight. The number indicates the amino acid location in its protein Note: Only regions where amino acid change was selected and remained until the last analyzed fortnight are shown. Abbreviation: FN, fortnight. The number of variable positions, number of mutations, and number of sequences with mutation by region Abbreviation: aa, amino acid.

Bayesian coalescence analysis

In this study, trees were analyzed by Bayesian analysis instead of distance, likelihood, or parsimony methods. Consistent with the phylogenetic signal analysis, trees for nsp1, E, and Orf6 showed a star‐like topology. Nevertheless, different proportions of clade formation could be observed in trees of Orf8, nsp14, Orf3a, N, S, and nsp3 regions (Figure 2). Finally, from the mentioned regions, nsp3 and S showed a better clade constitution. This analysis allowed to differentiate regions displaying a diversification process (nsp3, nsp14, Orf3a, S, Orf8, and N) from those that even after 4 months showed an incipient one (nsp1, E, and Orf6). Furthermore, this nucleotide analysis is complemented by the previous study of amino acid variations in each region. However, it is important to note that due to the low phylogenetic signal observed for each region, results can only be considered as preliminary.

Figure 2

Bayesian trees of 29 sequences of nsp1 (540 nt), 225 sequences of nsp3 (5835 nt), 65 sequences of nsp14 (1581 nt), 183 sequences of S (3822 nt), 74 sequences of Orf3a (828 nt), 11 sequences of E (228 nt), 12 sequences of Orf6 (186 nt), 23 sequences of Orf8 (366 nt), and 113 sequences of N (1260 nt). Scale bar represents substitutions per site Nsp3 and S sequences were selected to perform the evolutionary rate analysis, as both regions provided the best phylogenetic information among studied regions. The observed evolutionary rate for SARS‐CoV‐2 nsp3 protein was estimated as 1.37 × 10−3 nucleotide substitutions per site per year (s/s/y) (95% HPD interval 9.16 × 10−4 to 1.91 × 10−3). However, the corresponding figures for S were estimated in 2.19 × 10−3 nucleotide s/s/y (95% HPD interval 3.19 × 10−3 to 1.29 × 10−3). In both genomic regions, date randomization analyses showed no overlapping between the 95% HPD substitution rate intervals obtained from real data and date‐randomized data sets. This fact suggests that the original data set has enough temporal signal to perform analyses with temporal calibration based on tip dates (Figure 3).

Figure 3

A comparison of the evolutionary rates estimated using BEAST for the original data set and the date‐randomized data sets (312 sequences). This analysis was performed for regions nsp3 (5835 nt) and S (3822 nt). s.s.y. = substitutions/site/year

DISCUSSION

The phylogenetic characterization of an emerging virus is crucial to understand the way the virus and the pandemic will evolve. Thus, a detailed study of the SARS CoV‐2 genome allows, on the one hand, to contribute to the knowledge of viral diversity to detect the most suitable regions to be used as antivirals or vaccines targets. On the other hand, the large amount of information that has been continuously generated since SARS CoV‐2 emergence in human beings is allowing study of its genome and describing the real‐time evolution of a new virus like never before. In the present study, the molecular evolution and viral lineages of SARS‐CoV‐2 in nine genomic regions, during eight successive fortnights, were analyzed using three different approaches: phylogenetic signal assessment, the emergence of amino acid substitutions, and Bayesian evolutionary rate estimation. In this context, the observed phylogenetic signals of nine coding regions were very low and the obtained trees were consistent with this finding, showing star‐like topologies in some viral regions (nsp1, E, and Orf6). However, after a 4‐month evolution period, it was possible to identify regions (nsp3, S, Orf3a, Orf8, and N) revealing an incipient formation of viral lineages, despite the phylogenetic signal, both at the nucleotide and amino acid levels from FN3. On the basis of these findings, the SARS‐CoV‐2 evolutionary rate was estimated, for the first time, for the two regions showing higher variability (S and nsp3). With respect to the phylogenetic signal, several simulation studies have proven that for a set of sequences to be considered robust, the central and lateral areas representing the unresolved quartets must not be >40%. In this regard, none of the nine analyzed regions have met this requirement. Three regions (E, nsp1, and Orf6) presented values of 100% unresolved quartets. Most regions (nsp14, Orf3a, Orf8, and N) reached values higher than 85%. Only in regions nsp3 and S, the number of unresolved quartets dropped to ~60%. Thus, despite being a virus with an RNA genome, the short time elapsed since its emergence, and possibly genetic restrictions have led to a constrained evolution of SARS‐CoV‐2 in these months. For this reason, it is expected that trees generated from SARS‐CoV‐2 partial sequences in the first months of the pandemic are unreliable for defining clades. Therefore, they should be analyzed with caution. As Bayesian analysis allows to infer phylogenetic patterns from tree distributions, it represents a more reliable tool to compare different evolutionary behaviors. Bayesian analysis helps to obtain a tree topology that is closer to reality in the current conditions of SARS‐CoV‐2 pandemic. The phylogenetic analysis for nsp1, E, and Orf6 regions confirmed the star‐like topologies in accordance with a lower diversification of these regions using the sequences available up to FN8 (Figure 2). Trees generated from nsp14 and Orf8 are at an intermediate point, where the formation of small clusters can be observed. In fact, a mutation at position 28 144 (Orf8: L84S) has been proposed as a possible marker for viral classification. , Finally, trees obtained from regions Orf3a, N, nsp3, and S showed the best clade formation. Indeed, in the most variable regions nsp3 and S, it can be clearly seen that sequences are separated into two large groups. Although the clusters observed for nsp3 and S showed high support values, these results should be taken with precaution and longer periods should be considered to obtain more accurate phylogenetic data. However, even when data are not the most accurate to study the spread or clade formation, , they provide a good representation of the way the virus is evolving. The analysis of amino acid frequencies allowed identifying different degrees of region conservation throughout the viral genome due to positive and negative pressures. In particular, nsp3, S, Orf8, and N showed some substitutions in high frequencies. This would indicate, as other authors have previously reported, the frequent circulation of polymorphisms due to a significant positive pressure. , , Additionally, as S and N are among the candidates to be used in the formulation of vaccines and antibody treatment, it will be important to monitor these substitutions in different geographic regions to improve treatment and vaccination efficacy. , , In particular, the appearance of the D614G variant in the third week and its rapid increase until reaching an 88% prevalence in the eighth week could reflect an improvement in viral fitness, as it has been previously reported. This is supported by studies on SARS CoV showing that predicted S protein domains underwent the most extensive amino acid substitutions and the strongest positive selection. Contrarily, in regions nsp1, nsp14, E, and Orf6, no substitutions were selected during the first 4 months of the pandemic. This would suggest that these regions present constraints to change due to a great negative selection pressure, as it has been recently reported. In the present study, the evolutionary rate for SARS‐CoV‐2 genes was estimated by analyzing a large number of sequences, which were carefully curated and had a good temporal and spatial structure. Additionally, the most phylogenetically informative regions of the genome (nsp3 and S) were used for analysis, reinforcing the results confidence. Previous studies on SARS‐CoV‐2 have reported similar data, ranging from 1.79 × 10−3 to 6.58 × 10−3 s/s/y, for the complete genome. , However, in both articles, small data sets of complete genomes were used (N = 32 and 54, respectively). As studies were performed early in the outbreak and due to data sets' temporal structure, analysis could have led to less precise estimates of the evolutionary rate. Alternatively, another study from van Dorp et al., analyzing 7666 sequences, has obtained different results with a remarkably low evolutionary rate (6 × 10−4 nucleotide/genome/year). However, it is important to consider that van Dorp et al. estimated the evolutionary rate using the complete genome, including several highly conserved genomic regions, whereas in our work, the estimation was performed with the most variable regions of the genome. Additionally, tests randomizing the dates of nsp3 and S data sets were carried out; they showed that these partial genomic regions have enough temporal structure and that they are informative, allowing the estimation of evolutionary rates. In this context, our results (1.37 × 10−3 s/s/y for NSp3 and 2.19 × 10−3 s/s/y for S) are in close agreement with those published for SARS‐CoV genome, which have been estimated to range between 0.80 and 3.01 × 10−3 s/s/y. , , In particular, Zhao et al. estimated a similar evolutionary rate for the SARS‐CoV S gene. Moreover, our estimated values are in the same order of magnitude as other RNA viruses. Although we should be cautious with the interpretation of these results, our date randomization analysis indicated a robust temporal signal. In addition, the importance of separately studying the evolutionary rate of the S genomic region arises from the fact that it represents the main target for antiviral agents and vaccines, as it includes the SARS‐CoV‐2 receptor‐binding domain, a crucial structure for the virus to enter host cells, and binding site for neutralizing antibodies. “Furthermore, a re‐infection case occurring 142 days after the first infection episode has been reported. The second infection virus sequence showed 4 changes out of 14 amino acids in the spike protein and 2 changes in nsp3, the two genome genes considered phylogenetically most informative in our work. As neutralizing antibodies are targeted against the spike protein, a high evolutionary rate in this gene can imply changes in the circulating virus, thereby turning it less susceptible to neutralizing antibodies generated during the first infection. In fact, certain mutations in the spike protein, more precisely in the receptor‐binding and in the N‐terminal domain, have been reported to confer a reduced susceptibility to neutralizing antibodies. , For this reason, the evolutionary rate of S and nsp3 genes, reported separately here for the first time, is a crucial issue, as it may have implications for vaccines development, vaccine efficacy, or natural re‐infections.” Despite limitations of the evolutionary study of an emerging virus, where the selection pressures are still low, and thus low variability, this study has an advantage: the extremely careful selection of a big sequence data set to be analyzed. First, sequences were selected considering their good temporal signal and their balanced spatial (geographic) distribution. Second, attention was paid to eliminate sequences with low coverage and indeterminacies that could generate bias in the phylogenetic analysis of a virus that is beginning to evolve in a new host. The appearance of a virus means an adaptation challenge. In this sense, both SARS‐CoV and SARS‐CoV‐2 have shown a rapid emergence of several lineages in a short period, , reflecting a high adaptability. However, the spike of SARS‐CoV‐2 binds to the host cell receptor with a 10–20‐fold greater affinity as compared with SARS‐CoV and contains a polybasic (furin) cleavage site insertion, which may enhance the virus infectivity. Thus, changes in the S protein make an important contribution, turning SARS‐CoV 2 to spillover stage, which shows a significantly higher spread than SARS‐CoV and MERS‐CoV. Due to this fact, SARS‐CoV 2 becomes the most important pandemic of the century. In this context, results obtained in this study about the uneven diversity of nine crucial viral regions and the determination of the evolutionary rate are decisive to understanding essential features of viral emergence. Nevertheless, monitoring SARS‐CoV‐2 population will be required to determine the evolutionary dynamics of new mutations as well as to understand the way they affect viral fitness in human hosts.

CONFLICT OF INTERESTS

The authors declare that there are no conflict of interests.

AUTHOR CONTRIBUTIONS

Data curation, acquisition of data, analysis and interpretation of data, drafting the article, final approval of the version to be submitted: Matías J. Pereson. Data curation, acquisition of data, analysis and interpretation of data, revising the article critically for important intellectual content, final approval of the version to be submitted: Laura Mojsiejczuk. Data curation, Validation, revising the article critically for important intellectual content, final approval of the version to be submitted: Alfredo P. Martínezc. Data curation, Validation, drafting the article, final approval of the version to be submitted: Diego M. Flichman. Data curation, acquisition of data, analysis and interpretation of data, drafting the article, final approval of the version to be submitted: Gabriel H. Garcia. Conception and design of the study, acquisition of data, analysis and interpretation of data, drafting the article, final approval of the version to be submitted: Federico A. Di Lello.

45 in total

1. Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment.

Authors: K Strimmer; A von Haeseler
Journal: Proc Natl Acad Sci U S A Date: 1997-06-24 Impact factor: 11.205

2. tipdatingbeast: an r package to assist the implementation of phylogenetic tip-dating tests using beast.

Authors: Adrien Rieux; Camilo E Khatchikian
Journal: Mol Ecol Resour Date: 2016-10-25 Impact factor: 7.090

3. The race for coronavirus vaccines: a graphical guide.

Authors: Ewen Callaway
Journal: Nature Date: 2020-04 Impact factor: 49.962

4. Transmission dynamics and evolutionary history of 2019-nCoV.

Authors: Xingguang Li; Wei Wang; Xiaofang Zhao; Junjie Zai; Qiang Zhao; Yi Li; Antoine Chaillon
Journal: J Med Virol Date: 2020-02-14 Impact factor: 2.327

5. A pneumonia outbreak associated with a new coronavirus of probable bat origin.

Authors: Peng Zhou; Xing-Lou Yang; Xian-Guang Wang; Ben Hu; Lei Zhang; Wei Zhang; Hao-Rui Si; Yan Zhu; Bei Li; Chao-Lin Huang; Hui-Dong Chen; Jing Chen; Yun Luo; Hua Guo; Ren-Di Jiang; Mei-Qin Liu; Ying Chen; Xu-Rui Shen; Xi Wang; Xiao-Shuang Zheng; Kai Zhao; Quan-Jiao Chen; Fei Deng; Lin-Lin Liu; Bing Yan; Fa-Xian Zhan; Yan-Yi Wang; Geng-Fu Xiao; Zheng-Li Shi
Journal: Nature Date: 2020-02-03 Impact factor: 69.504

Review 6. Origin and evolution of pathogenic coronaviruses.

Authors: Jie Cui; Fang Li; Zheng-Li Shi
Journal: Nat Rev Microbiol Date: 2019-03 Impact factor: 60.633

7. Mutational dynamics of the SARS coronavirus in cell culture and human populations isolated in 2003.

Authors: Vinsensius B Vega; Yijun Ruan; Jianjun Liu; Wah Heng Lee; Chia Lin Wei; Su Yun Se-Thoe; Kin Fai Tang; Tao Zhang; Prasanna R Kolatkar; Eng Eong Ooi; Ai Ee Ling; Lawrence W Stanton; Philip M Long; Edison T Liu
Journal: BMC Infect Dis Date: 2004-09-06 Impact factor: 3.090

8. SARS-CoV-2 and ORF3a: Nonsynonymous Mutations, Functional Domains, and Viral Pathogenesis.

Authors: Elio Issa; Georgi Merhi; Balig Panossian; Tamara Salloum; Sima Tokajian
Journal: mSystems Date: 2020-05-05 Impact factor: 6.496

9. A Snapshot of SARS-CoV-2 Genome Availability up to April 2020 and its Implications: Data Analysis.

Authors: Carla Mavian; Simone Marini; Mattia Prosperi; Marco Salemi
Journal: JMIR Public Health Surveill Date: 2020-06-01

10. The global spread of 2019-nCoV: a molecular evolutionary analysis.

Authors: Domenico Benvenuto; Marta Giovanetti; Marco Salemi; Mattia Prosperi; Cecilia De Flora; Luiz Carlos Junior Alcantara; Silvia Angeletti; Massimo Ciccozzi
Journal: Pathog Glob Health Date: 2020-02-12 Impact factor: 2.894

12 in total

1. SARS-CoV-2 Mutant Spectra at Different Depth Levels Reveal an Overwhelming Abundance of Low Frequency Mutations.

Authors: Brenda Martínez-González; María Eugenia Soria; Lucía Vázquez-Sirvent; Cristina Ferrer-Orta; Rebeca Lobo-Vega; Pablo Mínguez; Lorena de la Fuente; Carlos Llorens; Beatriz Soriano; Ricardo Ramos-Ruíz; Marta Cortón; Rosario López-Rodríguez; Carlos García-Crespo; Pilar Somovilla; Antoni Durán-Pastor; Isabel Gallego; Ana Isabel de Ávila; Soledad Delgado; Federico Morán; Cecilio López-Galíndez; Jordi Gómez; Luis Enjuanes; Llanos Salar-Vidal; Mario Esteban-Muñoz; Jaime Esteban; Ricardo Fernández-Roblas; Ignacio Gadea; Carmen Ayuso; Javier Ruíz-Hornillos; Nuria Verdaguer; Esteban Domingo; Celia Perales
Journal: Pathogens Date: 2022-06-08

Review 2. Phylogenetic and phylodynamic approaches to understanding and combating the early SARS-CoV-2 pandemic.

Authors: Stephen W Attwood; Sarah C Hill; David M Aanensen; Thomas R Connor; Oliver G Pybus
Journal: Nat Rev Genet Date: 2022-04-22 Impact factor: 59.581

Review 3. Neutralising antibody escape of SARS-CoV-2 spike protein: Risk assessment for antibody-based Covid-19 therapeutics and vaccines.

Authors: Daniele Focosi; Fabrizio Maggi
Journal: Rev Med Virol Date: 2021-03-16 Impact factor: 11.043

4. Evolutionary analysis of SARS-CoV-2 spike protein for its different clades.

Authors: Matías J Pereson; Diego M Flichman; Alfredo P Martínez; Patricia Baré; Gabriel H Garcia; Federico A Di Lello
Journal: J Med Virol Date: 2021-02-09 Impact factor: 20.693

Review 5. A review of novel coronavirus disease (COVID-19): based on genomic structure, phylogeny, current shreds of evidence, candidate vaccines, and drug repurposing.

Authors: S Udhaya Kumar; N Madhana Priya; S R Nithya; Priyanka Kannan; Nikita Jain; D Thirumal Kumar; R Magesh; Salma Younes; Hatem Zayed; C George Priya Doss
Journal: 3 Biotech Date: 2021-03-27 Impact factor: 2.406

6. SARS-CoV-2 genomic diversity and the implications for qRT-PCR diagnostics and transmission.

Authors: Nicolae Sapoval; Medhat Mahmoud; Michael D Jochum; Yunxi Liu; R A Leo Elworth; Qi Wang; Dreycey Albin; Huw A Ogilvie; Michael D Lee; Sonia Villapol; Kyle M Hernandez; Irina Maljkovic Berry; Jonathan Foox; Afshin Beheshti; Krista Ternus; Kjersti M Aagaard; David Posada; Christopher E Mason; Fritz J Sedlazeck; Todd J Treangen
Journal: Genome Res Date: 2021-02-18 Impact factor: 9.438