Literature DB >> 34044192

E484K as an innovative phylogenetic event for viral evolution: Genomic analysis of the E484K spike mutation in SARS-CoV-2 lineages from Brazil.

Patrícia Aline Gröhs Ferrareze¹, Vinícius Bonetti Franceschi², Amanda de Menezes Mayer², Gabriel Dickin Caldana¹, Ricardo Ariel Zimerman³, Claudia Elizabeth Thompson⁴.

Abstract

The COVID-19 pandemic caused by SARS-CoV-2 has affected millions of people since its beginning in 2019. The propagation of new lineages and the discovery of key mechanisms adopted by the virus to overlap the immune system are central topics for the entire public health policies, research and disease management. Since the second semester of 2020, the mutation E484K has been progressively found in the Brazilian territory, composing different lineages over time. It brought multiple concerns related to the risk of reinfection and the effectiveness of new preventive and treatment strategies due to the possibility of escaping from neutralizing antibodies. To better characterize the current scenario we performed genomic and phylogenetic analyses of the E484K mutated genomes sequenced from Brazilian samples in 2020. From October 2020, more than 40% of the sequenced genomes present the E484K mutation, which was identified in three different lineages (P.1, P.2 and B.1.1.33 - posteriorly renamed as N.9) in four Brazilian regions. We also evaluated the presence of E484K associated mutations and identified selective pressures acting on the spike protein, leading us to some insights about adaptive and purifying selection driving the virus evolution.

Entities: Chemical

Keywords: COVID-19; E484K; Infectious diseases; Severe acute respiratory syndrome coronavirus 2; Viral evolution

Mesh：

Substances：

Year: 2021 PMID： 34044192 PMCID： PMC8143912 DOI： 10.1016/j.meegid.2021.104941

Source DB: PubMed Journal: Infect Genet Evol ISSN： 1567-1348 Impact factor: 4.393

Introduction

The etiological agent of COVID-19, named SARS-CoV-2, belongs to the Coronaviridae family composed of enveloped positively oriented single-stranded RNA viruses (Zhu et al., 2020). The first cases of the disease were reported in Wuhan province, China, in December 2019. Since then the number of infected people has increased exponentially worldwide and the World Health Organization (WHO) declared it a pandemic on 11 March 2020. As of 24 January 2021, the number of confirmed cases globally has scaled up to 99 million, with over 2 million deaths. Brazil is the third country in total number of confirmed cases, already exceeding 8.8 million infected people and 216,445 deaths (Johns Hopkins Coronavirus Resource Center, 2021). Compared to Latin America, which represents ~18% of the world cases, Brazil accounts for almost half of the COVID-19 notifications. However, the number of cases per million people(~41,600) is comparable to other countries such as Argentina (~41,300), Colombia (~39,600), Costa Rica (~37,160), Chile (~36,570), and Peru (~33,170) (Ritchie et al., 2021). Nearly 410,000 genomes (up to January 2021) sequenced by researchers from several countries are available on the Global Initiative on Sharing All Influenza Data (GISAID) (Shu and McCauley, 2017), which is crucial to investigate the SARS-CoV-2 genomic epidemiology. A standardized nomenclature was proposed to reflect genetic characteristics and the viral geographical spread patterns. Two ancestral lineages, A and B, have been described containing sublineages that are distinguished by different recurrent mutations and phylogenetic profiles (Rambaut et al., 2020a). More recently, special attention has been directed to the understanding of aspects that might impact the virulence and transmissibility of SARS-CoV-2 (Korber et al., 2020; Li, 2020; Toyoshima et al., 2020; Volz et al., 2020) and the influence of different mutations in the effectiveness of COVID-19 vaccines and immunotherapies (Andreano et al., 2020; Pinto et al., 2020; Yu et al., 2020). Some viral lineages have been studied in more detail, mainly those carrying mutations in the spike (S) glycoprotein, since it is the binding site to the human ACE2 (hACE2) receptor, an essential step to invade the host cell. Currently, there are three lineages of major worldwide concern: B.1.1.7, B.1.351, and P.1. The former emerged in England in mid-September 2020 and it is characterized by 14 lineage-specific amino acid substitutions, 7 deletions (Rambaut et al., 2020b) and has been rapidly spread across the UK and Europe ever since its first appearance (Rambaut et al., 2020c). Two substitutions present in this lineage deserve special attention: N501Y in the Receptor Binding Domain (RBD) of S1 and P681H near the polybasic RRAR sequence in the furin-like cleavage region. N501Y is one of the key contact residues interacting with hACE2 and P681H is one of four residues comprising the insertion that creates a furin-like cleavage site between S1 and S2, which is not found in closely-related coronaviruses (Xia et al., 2020). The second lineage probably emerged in South Africa in August 2020 and harbors three mutations in RBD: K417N, E484K, and N501Y (Tegally et al., 2020). The most recent lineage is P.1, derived from B.1.1.28, which was recently identified in returning travelers from Manaus (Amazonas state, Brazil) who arrived in Japan. It has almost the same three mutations present in RBD as the South African lineage, except for the substitution in amino acid site 417, where the original lysine (K) is substituted for a threonine (T), instead of an asparagine (N). Its appearance is likely to have arisen independently (Faria et al., 2021). Without previous introductions in other Latin American countries, which have been affected by their own specific mutations, such as the convergent evolution of the S:T1117I in Costa Rica (Molina-Mora et al., 2021) the E484K mutation arose independently in Brazil and was identified in the Rio de Janeiro state (Southeast Brazil) in early-October carried by the P.2 lineage (Voloch et al., 2021). In January 2021, the E484K mutation was identified in several viral genomes from Brazil. Due to its rapid spread, the independent origins and the potential implications in vaccination and passive immune therapies, E484K has received particular attention ever since. In Qatar, 14 days after the second vaccine dose (Pfizer–BioNTech), the effectiveness against the B.1.351 variant was 75.0% (Abu-Raddad et al., 2021); in South Africa, the AstraZeneca ChAdOx1 vaccine presented only 10% protection against mild-to-moderate disease associated with the B.1.351, while 75% of protection was obtained against B.1.1.7 in the U.K. (Gupta, 2021). However, in Brazil, the CoronaVac/Sinovac effectiveness was estimated at around 50% against symptomatic infection P.1 variant (Gupta, 2021). Importantly, B.1.351, P.1, and P.2 carry this substitution associated with escape from neutralizing antibodies (Baum et al., 2020; Greaney et al., 2021; Weisblum et al., 2020). Recently, this mutation was also identified in a sample from a reinfected patient (Nonaka et al., 2021). Moreover, B.1.1.7, B.1.351, and P.1 also harbor N501Y mutation, which is associated with enhanced receptor affinity (Starr et al., 2020) and increased infectivity and virulence in a mouse model (Gu et al., 2020). The presence of E484K and N501Y substitutions in the same SARS-CoV-2 genome may be particularly relevant for viral evolution. This combination was shown to induce more conformational changes than the N501Y mutant alone, potentially altering antibodies' complementarity to this region resulting in the above mentioned immune escape phenomena (Nelson et al., 2021). Structural analysis points to E484K as potentially the most crucial mutation so far in brazilian SARS-CoV-2 genomes. It creates a new site for the amino acid 75 hACE-2 binding. This interaction seems even stronger than the binding between hACE-2 and the original main site located at position 501 (at RBD and hACE-2 interface) (Nelson et al., 2021). We speculate that the consequent neutralization escape due to E484K alone or as a part of a larger array of distinct mutations might act as a common evolutionary solution for several different viral lineages. Due to the independent origin of two current known lineages harboring E484K in Brazil, we aimed to describe the mutation patterns of these lineages, investigate phylogenetic relationships and perform positive selection tests to identify if adaptive evolution has acted as a major evolutionary force leading to the increase of amino acid variability in the RBD sites.

Results

Comparative genomic analyses

Firstly detected in the South African B.1.351 lineage, the S1 protein mutation E484K is now present in new emerging variants from Brazil. The analysis of genomes containing E484K, downloaded from the GISAID database, showed a distribution of 169 amino acid residues corresponding to nonsynonymous mutations and four amino acid residue deletions in 134 Brazilian samples (Table S1) collected between October and December 2020 (Fig. S1). Regarding synonymous mutations, 217 genome positions had transitions (Fig. 1 and Table S2) and 113 genome positions had associated transversions (Fig. 1 and Table S2). The arising of E484K mutated genomes with at least four different associated mutation patterns for lineages P.1, P.2, and N.9 (B.1.1.33.9) can be seen over time in the aligned genomes (Fig. S1 and Table S1).

Fig. 1

UTR: Untranslated region; Syn: Synonymous substitution; del: deletion; ORF: Open Reading Frame; Nsp: Non-structural protein; S: Spike; E: Envelope; M: Membrane; N: Nucleocapsid.

Histogram of frequent mutations observed in the Brazilian SARS-CoV-2 genomes harboring E484K mutation. Red labels above the bars indicate absolute nucleotide position and the blue labels indicate effects of these mutations in the corresponding proteins. As P.1 has only 19 genomes represented and multiple mutations, only main mutations of concern were highlighted. UTR: Untranslated region; Syn: Synonymous substitution; del: deletion; ORF: Open Reading Frame; Nsp: Non-structural protein; S: Spike; E: Envelope; M: Membrane; N: Nucleocapsid. B.1.1 defining mutations were widespread in almost all sequences as expected (e. g. S:D614G, N:R203K, N:G204R, 5’UTR:C241T, and synonymous substitutions in nucleotide positions C3037T and C14408T) (Fig. 1). As B.1.1.28 is the ancestral lineage of P.1 and P.2, sequences from these three lineages harbor the G25088T (S:V1176F) mutation. Four sequences classified as N.9 carry B.1.1.33 lineage-defining mutations (T27299C (NS6:I33T) and T29148C (N:I292T)), but have also five missense mutations in ORF1ab and one in ORF7b (E33A). This pattern indicates a new lineage derived from B.1.1.33 (posteriorly defined as B.1.1.33.9 or N.9 lineage), which also possesses the E484K replacement, as P.1 and P.2 sequences. A hallmark of these Brazilian lineages is the presence of multiple lineage-defining mutations per lineage (Table 1 ).

Table 1

Lineage/Mutation	N.9 (B.1.1.33 with E484K)	P.1	P.2
5′ UTR	–	–	C100T
ORF1ab	G1264T (Nsp2)C6573T (Nsp3: S2103F)C7600T (Nsp3)C7851T (Nsp3: A2529V)T11078C (Nsp6: F3605L)C19602T (Nsp14: T6446I)G19656T (Nsp15: R6464M)	T733C (Nsp1)C2749T (Nsp3)C3828T (Nsp3: S1188L)A5648C (Nsp3: K1795Q)A6319G (Nsp3)A6613G (Nsp3)del11288–11,296 (Nsp6: 3675–3677 SGF)C12778T (Nsp9)C13860T (Nsp12: T4532I)G17259T (Nsp13: S5665I)	T10667G (Nsp5: L3468V)C11824T (Nsp6)A12964G (Nsp9)
S	G23012A (E484K)	C21614T (L18F)C21621A (T20N)C21638T (P26S)G21974T (D138Y)G22132T (R190S)A22812C (K417T)G23012A (E484K)A23063T (N501Y)C23525T (H655Y)C24642T (T1027I)	G23012A (E484K)
ORF3a	–	T26149C (S253P)	–
E	–	–	–
M	–	–	–
ORF6	–	–	–
ORF7a	–	–	–
ORF7b	A27853C (E33A)	–	–
ORF8	–	G28167A (E92K)insG28262GAACA (ORF8)	C28253T (F120F)
N	–	C28512G (P80R)AGTAGGG28877-83TCTAAAC (RG203KR)	G28628T (A119S)G28975T (M234I)
ORF10	–	–	–
3′ UTR	C29722T	–	C29754T

ORF1ab mutations are represented by its amino acid positions relative to ORF1a (Nsp1-Nsp11) and ORF1b (Nsp12-Nsp16). ins: insertion. The E484K mutation in the lineages is indicated in bold.

Lineage-defining mutations of each of the three Brazilian lineages carrying the E484K mutation. These mutations do not necessarily reflect pangolin lineage assignment defining-mutations, but were extracted based on their representativeness in the majority of sequences of each lineage (https://github.com/cov-lineages/pangolin). ORF1ab mutations are represented by its amino acid positions relative to ORF1a (Nsp1-Nsp11) and ORF1b (Nsp12-Nsp16). ins: insertion. The E484K mutation in the lineages is indicated in bold. The number of genetic changes associated with each E484K Brazilian lineage is highly diverse. N.9 carries a mean of 19.2 (range: 17–22) mutations (considering single nucleotide polymorphisms, insertion and deletion as a single event), while P.1 possesses on average 30.1 changes (range: 24–33). Despite harboring a lower number of mutations (mean: 18.5), P.2 genomes have the highest standard deviation (SD = 2.2) and range (15–25) among these lineages. These data suggest that both P.1 and P.2 have been circulating in Brazil for a longer period and might be fastly evolving. The analysis of the E484K mutated sequence EPI_ISL_832010, early detected in the municipality of Esteio, Rio Grande do Sul, shows a simpler set of nonsynonymous mutations (n = 10) when compared to others. This genome combines all the prevalent substitutions D614G from spike protein, N:R203K, N:G204R, and Nsp12:P323L, allied to other very frequent mutations (S:V1176F, N:A119S, N:M234I, Nsp5:L205V, and Nsp7:L71F). The presence of spike S:D614G, N:R203K, N:G204R, and Nsp12:P323L in all sequenced E484-containing genomes was observed. The last samples associated with the recent public health crisis in the state of Amazonas accumulates a higher number of divergences than other lineages (26 nonsynonymous substitutions and 3 amino acid deletions - Table S1) when compared with the SARS-CoV-2 reference genome (NC_045512.2). Regarding the lineage-defining mutations from P.1 lineages, 14 of a total of 19 genomes from the Amazonas monophyletic group present the spike mutations L18F, T20N, P26S, D138Y, and R190S. In this way, the spike protein of P.1 lineages in these Brazilian E484K mutated genomes is characterized by the presence of a variable number of modified sites, without the fixation of all known mutations. In fact, the P.1 and P2. clades are part of a larger monophyletic group with B.1.1.28, which is separated from the B.1.1.33 and N.9 (B.1.1.33 with E484K) clades and the reference genome. Interestingly, the B.1.1.28 genomes grouped in the clades of P.1 and P.2 lineages according to the presence of the Nsp7:L71F mutation. The monophyletic group formed by the B.1.1.28 and P.2 genomes presents these mutations, while the B.1.1.28 genomes grouped with P.1 sequences do not. These results corroborate the clusterization of these genomes in different lineages (Fig. 2 ).

Fig. 2

Bayesian phylogenetic inference of the 134 Brazilian E484K mutated genomes. Tips were colored by Brazilian state and the reference genome NC_045512.2 is represented in black. The branches were highlighted by lineage: green (B.1.1.33), light green (N.9), beige (B.1.1.28), light red (P.2) and blue (P.1). Mutations occurring in all analyzed genomes for each lineage were described next to the respective nodes. The asterisks indicate the SARS-CoV-2 genomes sequenced by our research group. Lineage-defining mutations as S:K471T, S:N501Y, S:T1027I, N:P80R, Nsp6:S106-107del, Nsp6:F108del, and NS8:E92K were found only in the P.1 group and reported for all 19 P.1 sequences. Other mutations such as S:V1176F were not found only in N.9 (B.1.1.33.9) lineage. There are also those mutations that are not known as lineage markers but were found in all lineages (n = 19) as Nsp3:K977Q, Nsp13:E341D, and NS3:S253P. New specific single substitutions (S:A27V, N:T16M, N:P151L, N:A267V, Nsp13:T216N, and Nsp14:P443S) were also evaluated. Specifically, S:A27V is located in the N-terminal domain. The N.9 (B.1.1.33.9) lineage carrying the E484K mutation was found in São Paulo (n = 3) and Amazonas (n = 1), while P.1 sequences were found only in the Amazonas state (n = 19). All other sequences (n = 111) belong to the P.2 lineage, the most widespread lineage considering sequenced data available so far. The Northeast region was represented by sequences from Alagoas (n = 2), Paraiba (n = 3), and Bahia (n = 1). The North region is represented by Amazonas sequences (n = 6). Southeast sequenced 44 P.2 genomes, 36 from Rio de Janeiro and 8 from São Paulo. Southern Brazil classified 55 genomes as P.2, 5 from Parana and 50 from Rio Grande do Sul, reinforcing its probable dissemination to all regions of Brazil (Fig. 3 ).

Fig. 3

Distribution of genomes harboring E484K mutation across different lineages (A) and Brazilian states (B) from October to December 2020.

Distribution of genomes harboring E484K mutation across different lineages (A) and Brazilian states (B) from October to December 2020. The worldwide emergence of E484K began in March 2020, with three sequences firstly represented. A significant increase was observed in October (n = 86), followed by successive increases in November (n = 366) and December (n = 374) (Fig. 4A). A similar trend was observed in Brazil, as E484K was observed in sequences obtained from October (n = 31), November (n = 87), and December (n = 40). Most importantly, the proportion of genomes carrying this replacement was 39.7%, 43.9%, and 43.5% from October to December in comparison to all genomes sequenced from Brazil (Fig. 4B). We believe this apparent slow-growing pattern in Brazil is due to heterogeneous and limited initiatives of sequencing in the country, and probably this substitution is already widespread through Brazilian states, as its harbored by three distinct and apparently independent evolving lineages.

Fig. 4

Monthly presence of the E484K mutation considering worldwide available data (A) and Brazilian genomes (B). For clarity, the number of genomes in (A) are represented in log10 scale.

Selection analysis

In order to obtain a reliable detection of sites submitted to positive/adaptive or negative/purifying selection, a random set of Brazilian genomes was tested with different approaches. Regarding individual site models, the Bayesian inference FUBAR (Fast, Unconstrained Bayesian AppRoximation) identified eight sites evolving under adaptive selection (positive selection) in the spike sequence (Table 2 ), with calculated synonymous and nonsynonymous average rates of 1.227 and 0.857, respectively. For these residues under adaptive pressure, six are included in known mutation sites of spike protein, including E484K (L5F, S12F, P26S, D138Y, A688V).

Table 2

FUBAR site table for positively selected sites.

Site	ɑ	β	Prob[ɑ < β]
5	1.880	20.275	0.9559
12	1.896	21.520	0.9584
26	1.879	20.481	0.9564
138	2.300	19.109	0.9214
155	2.266	17.643	0.9158
484	3.850	25.932	0.9109
677	2.931	19.933	0.9083
688	1.917	23.771	0.9628

FUBAR inferred the sites submitted to diversifying positive selection with a posterior probability ≥0.9.

FUBAR site table for positively selected sites. FUBAR inferred the sites submitted to diversifying positive selection with a posterior probability ≥0.9. The analysis with the codeml program from the PAML package confirmed all the positively selected sites predicted by FUBAR. The M3 vs. M0 (p < 0.001) comparison indicated a highly variable selective pressure among sites, while the M2 vs. M1 and M8 vs. M8a models comparison (p < 0.05) indicated some sites submitted to positive selection (Table 3, Table 4, Table 5 ). The results for the M3 model indicated a kappa value (ts/tv) of 2.82630 and a site proportion of 93.156% with an estimated omega (ω) value of 0.25424 (negative selection), while 3.536% of the sites were estimated to have ω = 4.90450 (positive selection) (Table 3). Differently from M2 that is compared with M1 to detect sites under positive selection, the M3 model for discrete evolution also indicates sites under a neutral selection (p1, ω1), being more recommended to evaluate variable selective pressures among sites. The proportion of sites and the ω values obtained for positive selection by M2 and M3 models were almost equal (M2: ω = 4.90459). In the case of M7 vs. M8 and M8a vs. M8 comparisons, the rejection of the alternative model (M8) by the absence of a significant likelihood ratio suggests a beta distribution of the ω. When applied to the analysis of the spike protein, the results of M8a, which is a more refined and conservative model, allows us to confirm that some residues of this protein are actually under adaptive selection.

Table 3

PAML (codeml): Parameters estimates and log-likelihood values under models of variable ω ratios among sites.

Model	Parametersa	lnL	Sites showing indications of positive selection
M0	ω = 0.40548	−6617.469118	None
M1	p₀ = 0.64290, p₁ = 0.35710ω₀ = 0, ω₁ = 1	−6607.96566	Not allowed
M2	p₀ = 0.96461, p₁ = 0.00003, p₂ = 0.03536ω₀ = 0.25444, ω₁ = 1, ω₂ = 4.90459	−6604.579083	5, 12, 26, 138, 155, 222, 484, 626, 677, 688, 1263
M3	p₀ = 0.93156, p₁ = 0.03308, p₂ = 0.03536ω₀ = 0.25424, ω₁ = 0.25458, ω₂ = 4.90450	−6604.578953	5, 12, 26, 138, 155, 222, 484, 626, 677, 688, 1263
M7	p = 0.00986, q = 0.01846	−6608.318528	Not allowed
M8	p₀ = 0.91883, p = 0.00675, q = 0.04374, p₁ = 0.08117, ω = 2.79061	−6605.852845	5, 12, 14, 20, 26, 27, 52, 54, 68, 138, 145, 153, 155, 190, 218, 221, 222, 231, 235, 263, 344, 417, 439, 484, 561, 583, 626, 658, 670, 677, 688, 776, 791, 879, 936, 1005, 1065, 1071, 1072, 1076, 1099, 1104, 1118, 1152, 1162, 1238, 1259, 1263, 1264, 1272

ω = dN/dS = average over sites; p0,p1 and p2 indicate the proportions of groups 0, 1 and 2 in each model, respectively; ω0, ω1 and ω2 indicate the ω values of groups 0, 1 and 2 in each model, respectively. p and q are beta parameters.

Table 4

PAML (codeml): Likelihood ratio statistics (2Δ/) for some comparisons between selection models.

Comparison	2Δ/	Probability values (p)
M3 vs. M0	25.78033	<0.001
M2 vs. M1	6.773154	<0.05
M8 vs. M7	4.931366	<0.1
M8 vs. M8a	4.225314	<0.05

The degrees of freedom used comparing models M3 vs. M0, M2 vs. M1, M8 vs. M7 and M8a vs. M8 are 4, 2, 2 and 1, respectively. Probability values ≤ 0.05 are considered as statistically significant.

Table 5

PAML (codeml) site table for positively selected sites.

Site	NEB probabilities		BEB probabilities
Site	Prob (ω > 1)	post mean for ω	Prob (ω > 1)	post mean ± SE for ω
5 L	0.988*	4.850	0.892	2.153 ± 0.516
12 S	0.988*	4.849	0.892	2.153 ± 0.517
26 P	0.989*	4.854	0.894	2.156 ± 0.514
138 D	0.758	3.779	0.741	1.922 ± 0.719
155 S	0.782	3.893	0.749	1.935 ± 0.712
222 A	0.775	3.857	0.746	1.930 ± 0.714
484 E	0.841	4.163	0.770	1.968 ± 0.692
626 A	0.860	4.252	0.778	1.980 ± 0.685
677 Q	0.909	4.483	0.802	2.017 ± 0.659
688 A	0.985*	4.835	0.886	2.145 ± 0.525
1263 P	0.880	4.348	0.787	1.994 ± 0.675

M2 selection model: Naive Empirical Bayes (NEB) analysis. Positively selected sites (*: P > 95%; **: P > 99%). Bayes Empirical Bayes (BEB).

PAML (codeml): Parameters estimates and log-likelihood values under models of variable ω ratios among sites. ω = dN/dS = average over sites; p0,p1 and p2 indicate the proportions of groups 0, 1 and 2 in each model, respectively; ω0, ω1 and ω2 indicate the ω values of groups 0, 1 and 2 in each model, respectively. p and q are beta parameters. PAML (codeml): Likelihood ratio statistics (2Δ/) for some comparisons between selection models. The degrees of freedom used comparing models M3 vs. M0, M2 vs. M1, M8 vs. M7 and M8a vs. M8 are 4, 2, 2 and 1, respectively. Probability values ≤ 0.05 are considered as statistically significant. PAML (codeml) site table for positively selected sites. M2 selection model: Naive Empirical Bayes (NEB) analysis. Positively selected sites (*: P > 95%; **: P > 99%). Bayes Empirical Bayes (BEB). The ω estimative (ω = d /d ) is a common method to detect positive selection, since it assumes that synonymous substitutions are neutral and the nonsynonymous are subject to selection. Consequently, a ω statistically higher than 1 would indicate the action of positive selection or a relaxed selective constraint, whereas low d /d values would mean conservation of the gene product due to purifying selection (Tennessen, 2008). Additionally to the FUBAR results, three sites were also considered with ω > 1 (Table 3, Table 5): 222, 626, and 1263. Of these, only site 222 is not related to the E484K-presenting lineages (A626S and P1263L/S are known). The list of positively selected sites identified by the M8 model is available on Table S4. To avoid the potential missense effects of the function lost by the sequence mutation, the purifying selection acts as a protective model. The FEL (Fixed Effects Likelihood) approach detected 19 sites under negative selection (Table 6 ). Three of them (189, 191, and 564) are near residues presenting known nonsynonymous mutations in the E484K mutated genomes, as S:R190S and S:F565L. The SLAC (Single-Likelihood Ancestor Counting) method identified four sites (55, 856, 943, and 1215), two of which were already predicted by FEL. The prevalence of negative selection across the spike protein is consistent with the low genome-wide mutation rate inferred for SARS-CoV-2 (van Dorp et al., 2020).

Table 6

FEL site table for negatively selected sites.

Site	ɑ	LRT	Prob[ɑ > β]
55	18.379	4.058	0.0440
91	27.071	2.792	0.0947
132	121.449	4.395	0.0360
180	121.798	4.401	0.0359
189	33.258	5.432	0.0198
191	120.557	4.393	0.0361
266	27.216	2.794	0.0946
324	118.682	4.370	0.0366
428	26.763	2.852	0.0913
475	16.429	3.268	0.0707
564	121.798	4.159	0.0414
821	21.710	3.065	0.0800
897	42.768	4.241	0.0395
910	42.680	3.305	0.0691
1120	43.397	3.598	0.0578
1126	26.763	3.495	0.0616
1215	18.327	2.861	0.0907
1228	42.759	4.136	0.0420
1251	42.680	3.305	0.0691

Sites identified under negative selection at p ≤ 0.1. Grey rows: significant sites also identified by SLAC.

FEL site table for negatively selected sites. Sites identified under negative selection at p ≤ 0.1. Grey rows: significant sites also identified by SLAC.

Discussion

The permanence of a pathogen inside a population host depends on its efficiency in key processes such as the replication, the escaping of the immune system and the binding to the cell receptor. Mutations that create an advantageous scenario towards the host response to infection enhance the pathogen fitness under the natural selection pressure. Consequently, the host-pathogen coevolution represents an important mechanism to understand the establishment and the prognostics of pathogenesis, since the infections are possibly the major selective pressure acting on humans (Sironi et al., 2015). Therefore, the circulation of low and moderate pathogens provides time for the pathogen adaptations, in such a way that the modulation of the immunity may possibly promote molecular convergence in different lineages over the time (Longdon et al., 2014). The presence of spike S:D614G, N:R203K, N:G204R, and Nsp12:P323L in all sequenced E484 mutated samples, reaching three different lineages, might suggest that these lineages show increased viral replication. Considering that E484K enhances escape from immune system antibodies, these may potentially lead to a viral advantage. The occurrence of simultaneous mutations as N:R203K and N:G204R is already known in the SARS-CoV-2 literature. However, the fixation of these mutations, as well as Nsp12:P323L and D614G in all the E484K evaluated genomes may indicate a novel adaptive relationship among these modifications resulting in viral evolutionary success. Even as an independent evolutionary event, the potential fixation of mutations such as E484K across lineages may indicate active mechanisms of adaptive selection and are very relevant in planning future therapeutic strategies (for example, newer vaccines and immunotherapy platforms). The deletion of three amino acids in the second helix of the transmembrane protein Nsp6 in P.1 genomes may affect the virus-induced cellular autophagy and the formation of double-membrane vesicles for the viral RNA synthesis (Benvenuto et al., 2020). The important role of subsequent stabilization of the flexible NTD by mutations has been speculated (Laha et al., 2020). Typically, NTD can harbor a larger number of evolutionary events than RBD, including mutations, insertions and deletions that could act allosterically altering the binding affinities between RBD and hACE-2 and inducing immune evasion. It is known that distinct positions may have a linked relationship in the final protein structure and may display some advantage by acting together to achieve increased stability, adaptability, viability and/or transmission efficiency (Laha et al., 2020). The potential causality or influence of the E484K and other substitutions for the effectiveness of neutralizing antibodies that bind the N-terminal domain of the spike protein (Chi et al., 2020) remains uncertain. Concerning the selection analysis, the FUBAR method was the first one to be tested. According to Murrell et al. (2013), the FUBAR method may have more detection power than methods like FEL, in particular when positive selection is present but relatively weak. The analysis of the South African clade V501·V2 (Pond et al., 2020) found some similar results for FUBAR evaluation, with the detection of adaptive selection at the sites 5 (L5F), 12 (S12F), and 484 (E484K). The residues 155 and 677 are not associated with the E484K-presenting lineages, however, recent data found evidence of evolutionary convergence of this mutation in at least six distinct sub-lineages, which could improve proteolytic processing, cell tropism, and transmissibility (Hodcroft et al., 2021). The amino acid residue 677 is next to the furin-like cleavage site (678 to 688). Interestingly, some studies suggest that the emergence of the SARS-CoV-2 in human species resulted from a five amino acid change in the critical S glycoprotein binding site (Fam et al., 2020). To the best of our knowledge, the impact of E484K in different lineages has not been deeply explored. It was structurally demonstrated that, at least in combination with K417N and N501Y, the substitution has profound impact in shifting the main site of contact between viral RBD and hACE-2 residues (Nelson et al., 2021). However, we still do not know if this holds true for other sets of mutations associated with E484K. In summary, we have demonstrated widespread dissemination of mutants harboring E484K replacement in geographically diverse regions in Brazil, as well as the potential fixation of the E484K mutation despite a short time (three months) after its first arising. This substitution was disseminated in our country as early as October 2020, although phylodynamics inferences placed its emergence in July (Voloch et al., 2021). The fact that E484K was found in the context of different mutations and lineages is suggestive that this particular substitution may act as a common solution for viral evolution in different genotypes. This hypothesis may be related to the profound impact of the mutation, which changes a negatively charged amino acid (glutamic acid) for a positively charged amino acid (lysine). Since this position is present in a highly flexible loop, it has been proposed that the presence of such mutation could create a strong ionic interaction between lysine (K) in RBD and amino acid 75 of the hACE-2 receptor shifting the major sites of binding in S1 from positions 497–502 to 484 (Nelson et al., 2021). Mutations in RBD could theoretically decrease the neutralizing serum activity of patients receiving vaccines against SARS-CoV-2 with unknown clinical consequences. Arguably the effect of E484K could be particularly relevant. In a previous publication, this mutation was associated with complete abolishment of all neutralizing activity in a high proportion of convalescent serum tested (Wibmer et al., 2021). Jangra et al. (2021) showed that the E484K affects the binding of serum polyclonal neutralizing antibodies and decreases the neutralization efficiency for serum with low or moderate IgG for the SARS-CoV-2 spike protein. Moreover, in the study of Xie et al. (2021), the combination of the mutations E484K + N501Y + D614G generated lower neutralization titers than the N501Y virus or the virus with three mutations from the UK variant (Δ69/70 + N501Y + D614G). When taken together, this growing body of evidence suggests that E484K should be the target of intense virologic surveillance. Studies testing the activity of serum from vaccinated patients against viruses or pseudoviruses with the aforementioned substitution should be considered a high public health priority. Second generation immune therapies and vaccines focusing on more conserved domains (for instance, in S2 fusion domain) may deserve special attention to assure continuous therapeutic efficacy.

Methods

Sequencing data retrieval

For the graphic evaluation of Brazilian lineages, all genomes available on the GISAID database until January 18th, 2021, and presenting the E484K mutation were included in the analysis. Genomes from Brazil and other countries submitted until January 24th, 2021, and with collected dates between January 1st and December 31st, 2020, were selected. The counting of genomes containing E484K was performed by the specification of the mutation in the search fields. The monthly count was defined by the values between the first and the last day of each month. The comparative genomic analysis of the E484K mutated genomes was performed with all the 134 genomes (Table S1) presenting the E484K mutation for Brazil in GISAID until January 18th, 2021. For these, a genomic multiple sequence alignment was performed on the MAFFT web server (Katoh and Standley, 2013) using default options and 1PAM = k/2 scoring matrix. Single nucleotide polymorphisms (SNPs) and insertions/deletions (INDELs) were assessed by using snippy variant calling pipeline v4.6.0 (https://github.com/tseemann/snippy). The mapping of the aligned sequences to the reference genome (NC_045512.2) features was created by the software GENEIOUS 2021.0.3 (https://www.geneious.com). Histogram of mutations was generated using a modified code from Lu et al., 2020 (https://github.com/laduplessis/SARS-CoV-2_Guangdong_genomic_epidemiology/).

Phylogenetic analysis

The previously aligned genomic sequences were used as input for the phylogenetic analysis. The reference genome NC_045512.2 was added as an outgroup and 16 other sequences from B.1.1.28 and B.1.1.33 lineages were added to analyze the phylogeny patterns with the E484K-containing genomes. The inference of the best evolutionary model was performed by ModelTest-NG (Darriba et al., 2020), which identified GTR + I (Generalised time-reversible model + proportion of invariant sites). The tree was built by Bayesian inference in MrBayes v3.2.7 (Huelsenbeck and Ronquist, 2001) with the GTR + I model and 5 million generations. The data convergence was evaluated with Tracer v1.7.1 (Rambaut et al., 2018) and tree visualization was created by FigTree software (http://tree.bio.ed.ac.uk/software/figtree/). In total, 780 SARS-CoV-2 genomes, collected from Brazil samples between February 1st and December 31st, 2020, available on GISAID, were used for this analysis. After an initial filtering step of undefined nucleotides in the region of spike protein, with deletion of genomes with a N ratio > 0.01, 589 genomes were selected. The second step of subsampling, with Augur toolkit (Huddleston et al., 2021) kept 498 genomes (approximately 119 representative genomes per lineage) for the multiple sequence alignment (Table S3). The genomic multiple sequence alignment was performed by MAFFT (Katoh and Standley, 2013) with default options and 1PAM = k/2 scoring matrix. The selection of the genome location for the spike protein was performed with the software UGENE (Okonechnikov et al., 2012), using the ORF coordinates of the NC_045512 reference genome as parameter. GTR + I was inferred by ModelTest-NG (Darriba et al., 2020) as the best evolutionary model for the spike sequences. ModelTest-NG also indicated identical sequences, which were removed resulting in 161 sequences. The Bayesian phylogenetic inference was performed by MrBayes v3.2.7 using those 161 unique spike selected sequences and 5 million generations. The data convergence was evaluated with Tracer v1.7.1. The analysis of sites under positive and negative selection was performed by HyPhy v2.5.23 (Pond et al., 2005) according to different approaches: (i) FUBAR (Unconstrained Bayesian AppRoximation) (Murrell et al., 2013), (ii) FEL (Fixed Effects Likelihood) (Kosakovsky Pond and Frost, 2005), and (iii) SLAC (Single-Likelihood Ancestor Counting) (Kosakovsky Pond and Frost, 2005). Finally, the PAML (Phylogenetic Analysis by Maximum Likelihood) (Yang, 2007) program was used for confirmation of possibly selected sites with the codeml package. All models were run using the F3x4 option in the PAML program, where expected codon frequencies were based upon nucleotide frequencies occurring at the three codon positions. The one-ratio model (M0) assumes one ω ratio for all sites. The neutral model (M1) presupposes a proportion p0 of conserved sites with ω0 = 0 and p1 = 1 - p0 of neutral sites with ω1 = 1, as would occur if almost all non-synonymous substitutions were either deleterious or neutral. The positive selection model (M2) adds an additional class of sites with frequency p2 = 1 - p0 - p1 and ω2 is estimated from the data. In the discrete model (M3), the probabilities (p0, p1 and p2) of each site which was submitted to purifying selection, neutral selection and positive selection, respectively, and their corresponding ω ratios (ω0, ω1, ω2) are inferred from the data. The beta model (M7) is a null test for positive selection, assuming a beta distribution with ω between 0 and 1. By last, the beta & ω (M8) model adds one extra class with the same ratio ω1. The LRTs (Likelihood Ratio Tests) were tested to investigate whether ω was significantly different from 1 for each pairwise comparison: M1a vs. M2a, M0 vs. M3, and M7 vs. M8. LRT performs the comparison both with the constraint of ω = 1 and without such constraint: LRT = 2 (ln1−ln2). It follows a chi-square distribution, where the number of degrees of freedom is equal to the number of additional parameters in the more complex model. For rejection of the null hypothesis of neutrality, we considered p ≤ 0.05. Finally, we applied the Naive Empirical Bayes (NEB) and Bayes Empirical Bayes (BEB) approaches available on the PAML package to calculate the posterior probability that each site belongs to the positively selected class. The following are the supplementary data related to this article.

Table S1

Brazilian sequenced genomes containing the E484K mutation and GISAID list of acknowledgements.

Table S2

Single Nucleotide Polymorphisms identified in the multiple genomes comparison.

Table S3

Brazilian sequenced genomes used for selection analysis (499 sequences) - GISAID list of acknowledgements.

Table S4

Positively selected sites identified by PAML with the M8 selection model. Supplementary material

Funding

Scholarships and Fellowships were supplied by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil () – Finance Code 001 and Universidade Federal de Ciências da Saúde de Porto Alegre. The funders had no role in the study design, data generation and analysis, decision to publish or the preparation of the manuscript.

Author contributions

Patrícia A. G. Ferrareze: Methodology, Software, Validation, Formal analysis, Investigation, Data Curation, Visualization, Writing - Original Draft, Writing - Review & Editing. Vinicius B. Franceschi: Formal analysis, Investigation, Visualization, Writing - Original Draft, Writing - Review & Editing. Amanda M. Mayer: Writing - Original Draft, Writing - Review & Editing. Gabriel D. Caldana: Writing - Original Draft, Writing - Review & Editing. Ricardo A. Zimerman: Conceptualization, Formal analysis, Investigation, Writing - Original Draft, Writing - Review & Editing. Claudia E. Thompson: Conceptualization, Methodology, Formal analysis, Investigation, Resources, Writing - Original Draft, Writing - Review & Editing, Supervision, Project administration. All authors have read and approved the manuscript.

Declaration of Competing Interest

The authors declare no competing interests.

43 in total

1. HyPhy: hypothesis testing using phylogenies.

Authors: Sergei L Kosakovsky Pond; Simon D W Frost; Spencer V Muse
Journal: Bioinformatics Date: 2004-10-27 Impact factor: 6.937

2. Posterior Summarization in Bayesian Phylogenetics Using Tracer 1.7.

Authors: Andrew Rambaut; Alexei J Drummond; Dong Xie; Guy Baele; Marc A Suchard
Journal: Syst Biol Date: 2018-09-01 Impact factor: 15.683

3. Neutralization of SARS-CoV-2 spike 69/70 deletion, E484K and N501Y variants by BNT162b2 vaccine-elicited sera.

Authors: Xuping Xie; Yang Liu; Jianying Liu; Xianwen Zhang; Jing Zou; Camila R Fontes-Garfias; Hongjie Xia; Kena A Swanson; Mark Cutler; David Cooper; Vineet D Menachery; Scott C Weaver; Philip R Dormitzer; Pei-Yong Shi
Journal: Nat Med Date: 2021-02-08 Impact factor: 53.440

4. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2.

Authors: Lucy van Dorp; Mislav Acman; Damien Richard; Liam P Shaw; Charlotte E Ford; Louise Ormond; Christopher J Owen; Juanita Pang; Cedric C S Tan; Florencia A T Boshier; Arturo Torres Ortiz; François Balloux
Journal: Infect Genet Evol Date: 2020-05-05 Impact factor: 3.342

5. Evaluating the Effects of SARS-CoV-2 Spike Mutation D614G on Transmissibility and Pathogenicity.

Authors: Erik Volz; Verity Hill; John T McCrone; Anna Price; David Jorgensen; Áine O'Toole; Joel Southgate; Robert Johnson; Ben Jackson; Fabricia F Nascimento; Sara M Rey; Samuel M Nicholls; Rachel M Colquhoun; Ana da Silva Filipe; James Shepherd; David J Pascall; Rajiv Shah; Natasha Jesudason; Kathy Li; Ruth Jarrett; Nicole Pacchiarini; Matthew Bull; Lily Geidelberg; Igor Siveroni; Ian Goodfellow; Nicholas J Loman; Oliver G Pybus; David L Robertson; Emma C Thomson; Andrew Rambaut; Thomas R Connor
Journal: Cell Date: 2020-11-19 Impact factor: 41.582

6. Escape from neutralizing antibodies by SARS-CoV-2 spike protein variants.

Authors: Yiska Weisblum; Fabian Schmidt; Fengwen Zhang; Justin DaSilva; Daniel Poston; Julio Cc Lorenzi; Frauke Muecksch; Magdalena Rutkowska; Hans-Heinrich Hoffmann; Eleftherios Michailidis; Christian Gaebler; Marianna Agudelo; Alice Cho; Zijun Wang; Anna Gazumyan; Melissa Cipolla; Larry Luchsinger; Christopher D Hillyer; Marina Caskey; Davide F Robbiani; Charles M Rice; Michel C Nussenzweig; Theodora Hatziioannou; Paul D Bieniasz
Journal: Elife Date: 2020-10-28 Impact factor: 8.140

7. Emergence in late 2020 of multiple lineages of SARS-CoV-2 Spike protein variants affecting amino acid position 677.

Authors: Emma B Hodcroft; Daryl B Domman; Daniel J Snyder; Kasopefoluwa Y Oguntuyo; Maarten Van Diest; Kenneth H Densmore; Kurt C Schwalm; Jon Femling; Jennifer L Carroll; Rona S Scott; Martha M Whyte; Michael W Edwards; Noah C Hull; Christopher G Kevil; John A Vanchiere; Benhur Lee; Darrell L Dinwiddie; Vaughn S Cooper; Jeremy P Kamil
Journal: medRxiv Date: 2021-02-21

8. Modifiable lifestyle factors and severe COVID-19 risk: a Mendelian randomisation study.

Authors: Shuai Li; Xinyang Hua
Journal: BMC Med Genomics Date: 2021-02-03 Impact factor: 3.063

9. SARS-CoV-2 genomic variations associated with mortality rate of COVID-19.

Authors: Yujiro Toyoshima; Kensaku Nemoto; Saki Matsumoto; Yusuke Nakamura; Kazuma Kiyotani
Journal: J Hum Genet Date: 2020-07-22 Impact factor: 3.172

Review 10. Receptor-binding domain-specific human neutralizing monoclonal antibodies against SARS-CoV and SARS-CoV-2.

Authors: Fei Yu; Rong Xiang; Xiaoqian Deng; Lili Wang; Zhengsen Yu; Shijun Tian; Ruiying Liang; Yanbai Li; Tianlei Ying; Shibo Jiang
Journal: Signal Transduct Target Ther Date: 2020-09-22

20 in total

1. Phylogenomics and population genomics of SARS-CoV-2 in Mexico during the pre-vaccination stage reveals variants of interest B.1.1.28.4 and B.1.1.222 or B.1.1.519 and the nucleocapsid mutation S194L associated with symptoms.

Authors: Francisco Barona-Gómez; Luis Delaye; Erik Díaz-Valenzuela; Fabien Plisson; Arely Cruz-Pérez; Mauricio Díaz-Sánchez; Christian A García-Sepúlveda; Alejandro Sanchez-Flores; Rafael Pérez-Abreu; Francisco J Valencia-Valdespino; Natali Vega-Magaña; José Francisco Muñoz-Valle; Octavio Patricio García-González; Sofía Bernal-Silva; Andreu Comas-García; Angélica Cibrián-Jaramillo
Journal: Microb Genom Date: 2021-11

Review 2. Evolutionary Traits and Genomic Surveillance of SARS-CoV-2 in South America.

Authors: Pablo A Ortiz-Pineda; Carlos H Sierra-Torres
Journal: Glob Health Epidemiol Genom Date: 2022-05-18

3. Genomic epidemiology of SARS-CoV-2 in Esteio, Rio Grande do Sul, Brazil.

Authors: Vinícius Bonetti Franceschi; Gabriel Dickin Caldana; Amanda de Menezes Mayer; Gabriela Bettella Cybis; Carla Andretta Moreira Neves; Patrícia Aline Gröhs Ferrareze; Meriane Demoliner; Paula Rodrigues de Almeida; Juliana Schons Gularte; Alana Witt Hansen; Matheus Nunes Weber; Juliane Deise Fleck; Ricardo Ariel Zimerman; Lívia Kmetzsch; Fernando Rosado Spilki; Claudia Elizabeth Thompson
Journal: BMC Genomics Date: 2021-05-20 Impact factor: 3.969

Review 4. Neutralising antibody escape of SARS-CoV-2 spike protein: Risk assessment for antibody-based Covid-19 therapeutics and vaccines.

Authors: Daniele Focosi; Fabrizio Maggi
Journal: Rev Med Virol Date: 2021-03-16 Impact factor: 11.043

5. Estimation of Secondary Household Attack Rates for Emergent Spike L452R Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) Variants Detected by Genomic Surveillance at a Community-Based Testing Site in San Francisco.

Authors: James Peng; Jamin Liu; Sabrina A Mann; Anthea M Mitchell; Matthew T Laurie; Sara Sunshine; Genay Pilarowski; Patrick Ayscue; Amy Kistler; Manu Vanaerschot; Lucy M Li; Aaron McGeever; Eric D Chow; Carina Marquez; Robert Nakamura; Luis Rubio; Gabriel Chamie; Diane Jones; Jon Jacobo; Susana Rojas; Susy Rojas; Valerie Tulier-Laiwa; Douglas Black; Jackie Martinez; Jamie Naso; Joshua Schwab; Maya Petersen; Diane Havlir; Joseph DeRisi
Journal: Clin Infect Dis Date: 2022-01-07 Impact factor: 9.079

Review 6. SARS-CoV-2 Variants: Mutations and Effective Changes.

Authors: Gene Park; Byeong Hee Hwang
Journal: Biotechnol Bioprocess Eng Date: 2021-12-28 Impact factor: 2.836

7. Limited spread of a rare spike E484K-harboring SARS-CoV-2 in Marseille, France.

Authors: Philippe Colson; Jacques Fantini; Nouara Yahi; Jeremy Delerce; Anthony Levasseur; Pierre-Edouard Fournier; Jean-Christophe Lagier; Didier Raoult; Bernard La Scola
Journal: Arch Virol Date: 2022-01-27 Impact factor: 2.685

8. Insights into the mutation T1117I in the spike and the lineage B.1.1.389 of SARS-CoV-2 circulating in Costa Rica.

Authors: Jose Arturo Molina-Mora
Journal: Gene Rep Date: 2022-02-08

Review 9. The biological and clinical significance of emerging SARS-CoV-2 variants.

Authors: Kaiming Tao; Philip L Tzou; Janin Nouhin; Ravindra K Gupta; Tulio de Oliveira; Sergei L Kosakovsky Pond; Daniela Fera; Robert W Shafer
Journal: Nat Rev Genet Date: 2021-09-17 Impact factor: 53.242

10. Reduced levels of convalescent neutralizing antibodies against SARS-CoV-2 B.1+L249S+E484K lineage.

Authors: Diego A Álvarez-Díaz; Katherine Laiton-Donato; Orlando Alfredo Torres-García; Hector Alejandro Ruiz-Moreno; Carlos Franco-Muñoz; Maria Angie Beltran; Marcela Mercado-Reyes; Miguel Germán Rueda; Ana Luisa Muñoz
Journal: Virus Res Date: 2021-11-12 Impact factor: 3.303