Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness.

Literature DB >> 35194619

Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness.

Fritz Obermeyer^1,2, Martin Jankowiak^1,2, Nikolaos Barkas¹, Stephen F Schaffner^1,3,4, Jesse D Pyle¹, Lonya Yurkovetskiy⁵, Matteo Bosso⁵, Daniel J Park¹, Mehrtash Babadi¹, Bronwyn L MacInnis^1,4,6, Jeremy Luban^1,5,6,7, Pardis C Sabeti^1,3,4,6,8, Jacob E Lemieux^1,9.

Abstract

Repeated emergence of SARS-CoV-2 variants with increased fitness necessitates rapid detection and characterization of new lineages. To address this need, we developed PyR 0 , a hierarchical Bayesian multinomial logistic regression model that infers relative prevalence of all viral lineages across geographic regions, detects lineages increasing in prevalence, and identifies mutations relevant to fitness. Applying PyR 0 to all publicly available SARS-CoV-2 genomes, we identify numerous substitutions that increase fitness, including previously identified spike mutations and many non-spike mutations within the nucleocapsid and nonstructural proteins. PyR 0 forecasts growth of new lineages from their mutational profile, identifies viral lineages of concern as they emerge, and prioritizes mutations of biological and public health concern for functional characterization. ONE SENTENCE
SUMMARY: A Bayesian hierarchical model of all SARS-CoV-2 viral genomes predicts lineage fitness and identifies associated mutations.

Entities: Chemical

Year: 2022 PMID： 35194619 PMCID： PMC8863165 DOI： 10.1101/2021.09.07.21263228

Source DB: PubMed Journal: medRxiv

The SARS-CoV-2 pandemic has been characterized by repeated waves of cases driven by the emergence of new lineages with higher fitness, where fitness encompasses any trait that affects the lineage’s growth, including its basic reproduction number (R0), ability to evade existing immunity, and generation time. Rapidly identifying such lineages as they emerge and accurately forecasting their dynamics is critical for guiding outbreak response. Doing so effectively would benefit from the ability to interrogate the entirety of the global SARS-CoV-2 genomic dataset. The large size (currently over 7.5 million virus genomes) and geographic and temporal variability of the available data present significant challenges that will only become greater as more viruses are sequenced. Current phylogenetic approaches are computationally inefficient on datasets with more than ~5000 samples and take days to run at that scale. Ad hoc methods to estimate the relative fitness of particular SARS-CoV-2 lineages are a computationally efficient alternative (1–3), but have typically relied on models in which one or two lineages of interest are compared to all others and do not capture the complex dynamics of multiple co-circulating lineages. Furthermore, estimates of relative fitness based on lineage frequency data alone (2–4) do not take advantage of additional statistical power that can be gained from analyzing the independent appearance and growth of the same mutation in multiple lineages. Performing a mutation-based analysis of lineage prevalence has the additional advantage of identifying specific genetic determinants of a lineage’s phenotype, which is critically important both for understanding the biology of transmission and pathogenesis and for predicting the phenotype of new lineages. The SARS-CoV-2 pandemic has already been dominated by several genetic changes of functional and epidemiological importance, including the spike (S) D614G mutation that is associated with higher SARS-CoV-2 loads (5, 6). In addition, mutations found in Variants of Concern (VoC), such as S:N439R, S:N501Y, and S:E484K, have been linked, respectively, to increased transmissibility (7), enhanced binding to ACE2 (8), and antibody escape (9, 10). Despite these successes, identifying functionally important mutations in the context of a large background of genetic variants of little or no phenotypic consequence remains challenging. We set out to formulate a principled approach to modeling the relative fitness of SARS-CoV-2 lineages, estimating their growth as a linear combination of the effects of individual mutations. We developed PyR0, a hierarchical Bayesian regression model that enables scalable analysis of the complete set of publicly available SARS-CoV-2 genomes, and that could be applied to any viral genomic dataset and to other phenotypes. The model, which is summarized in Figure 1A and described in detail in the supplemental note, avoids the complexity of full phylogenetic inference by first clustering genomes by genetic similarity (refining PANGO lineages (11)), and then estimating the incremental effect on growth rate of each of the most common amino acid changes on the lineages in which they appear. By regressing growth rate as a function of genome sequence, the model shares statistical strength among genetically similar lineages without explicitly relying on phylogeny. By modeling only the multinomial proportion of different lineages rather than the absolute number of samples for each lineage (4, 12), and by doing so within 14-day intervals in 1560 globally distributed geographic regions, the model achieves robustness to a number of sources of bias that affect all lineages, across regions, and over time, including differences in data collection and changes in transmission due to such factors as social behavior, public health policy, and vaccination.

Figure 1.

A. Overview of the PyR0 analysis pipeline. After clustering UShER’s mutation annotated tree, sequence data are used to construct spatio-temporal lineage prevalence counts ytpc and amino acid substitution covariates Xcf. Pyro is used to fit a Bayesian multivariate logistic multinomial regression model to ytpc and Xcf.

B. Relative fitness versus date of lineage emergence. Circle size is proportional to cumulative case count inferred from lineage proportion estimates and confirmed case counts. Inset table lists the 10 fittest lineages inferred by the model. R/RA is the fold increase in relative fitness over the Wuhan (A) lineage, assuming a fixed generation time of 5.5 days.

We fit PyR0 to 6,466,300 SARS-CoV-2 genomes available on GISAID (13, 14) as of January 20, 2022, in a model that contained 1544 PANGO lineages and 2904 nonsynonymous mutations. The output of the model is a posterior distribution for the relative fitness (exponential growth rate) of each lineage and for the contribution to the fitness from each mutation. Fitting this large model is computationally challenging, so we used stochastic variational inference, an approximate inference method that reduced our task to solving a 75-million-dimensional optimization problem on a GPU. Inference was implemented in the Pyro (15) probabilistic programming framework (see Supplemental Materials). The trained model can be used to infer lineage fitness, predict the fitness of completely new lineages, forecast future lineage proportions, and estimate the effects of individual mutations on fitness. The model’s lineage fitness estimates (Figure 1B) show a modest upward trend over time among all lineages, accompanied by numerous lineages with dramatically higher fitness. Sensitivity analyses revealed broad consistency of fitness estimates across spatial data subsets (Figure S1). The upward trend may in part reflect an upward bias caused by the lineage assignment process, as can be seen in simulation studies (Figure S2), but the high tail of the distribution exhibits elevated fitness values far in excess of this trend. The rate of increase in fitness was not constant between the emergence of the virus into human populations in late 2019 and early 2022. Rather, periods of rapid evolution in fitness occurred and heralded new waves of increase in case counts (Figure 1B and Figure 2CDE). The model correctly inferred BA.2 to have the highest fitness to date, 8.9-fold (95% CI, 8.6–9.2) higher than the original A lineage (Figure 1B inset). Similar fitness was estimated for other Omicron sub-lineages BA.1 and BA.1.1 (Figure 1B). These fitness estimates, obtained in mid January 2022, predict B.1.1.529 and sublineages (collectively called Omicron in the WHO classification) will continue to displace other lineages, including the previously dominant Delta (Figure S3). While PANGO lineages facilitate communication by providing a stable nomenclature, we observed some PANGO lineages with multiple successive peaks in some regions, which could not be accounted for by a multivariate logistic growth model. We therefore algorithmically refined the 1544 PANGO lineages into 3000 finer clusters, and found our model identified significant heterogeneity within some PANGO lineages (Figure S4). Notably, B.1.1 displayed the greatest variability among lineages, followed by B.1.

Figure 2.

A. Infectivity relative to WT of lentiviral vectors pseudotyped with the indicated Spike mutants. Target cells were HEK293T cells expressing ACE2 and TMPRSS2 transgenes. The genetic background of the Spike was Wuhan-Hu-1 bearing D614G. Red bars were significantly different from WT (adjusted p values shown). Black bars were not significantly different from WT. B. For the 1701 SARS-CoV-2 clusters with at least one amino acid substitution in the RBD domain we compare: i) the PyR0 prediction for the contribution to Δ log R from RBD substitutions only; to ii) antibody binding computed using the antibody-escape calculator in (17). The escape calculator is based on an intuitive non-linear model parameterized using deep mutational scanning data for 33 neutralizing antibodies elicited by SARS-CoV-2. PyR0 predictions exhibit high (Spearman) correlation with predictions from Greaney et al. C-E. We dissect PyR0 Δ log R estimates into S-gene (C), RBD (D), and non-S-gene (E) contributions for 3000 SARS-CoV-2 clusters (blue dots). The horizontal axis corresponds to the date at which each cluster first emerged. Red squares denote the median Δ log R within each monthly bin. The increased importance of S-gene mutations (notably in the RBD) over non-S-gene mutations starting around November 2021 is apparent.

We found that the model would have provided early warning of the rise of VoCs had it been routinely applied to SARS-CoV-2 samples, highlighting the benefit of timely publication of genomic data. For example, PyR0 would have forecast the coming dominance of B.1.1.7 in late November 2020 (Figure S5A), while the first models forecasting its rapid rise were published in mid December 2020 (16). Similar predictions would have been available for BA.1 by early December 2021 (Figure S5B, S6) and for AY.4 by May 2021 (Figure S5C). Likewise the elevated fitness of BA.2 was identified by mid December 2021 on the basis of 76 observed sequences (Figure S6). While variant-specific models were accurate and useful (2) in predicting the rise of these lineages, each modeling effort was specific to a particular lineage and geographic region; by contrast, PyR0 ‘s global approach provides similar early detection while also offering automated, rapid, and unbiased consideration of all variants and lineages, together with ranking based on relative fitness. When we tested the model’s predictive ability (Figure S5), we found that forecasts were reliable for 1–2 months into the future, when they tended to be disrupted by the emergence of a completely new strain (Table S1, Figure S7). Remarkably, the accuracy of forecasts stabilized typically within two weeks after the emergence of a new competitive lineage in a region (Figure S7). By basing fitness estimates on the contributions of individual mutations, PyR0 can forecast the fitness of novel or hypothetical lineages using their mutational profiles alone. This is possible with SARS-CoV-2 because of the high rate of convergent evolution (Table 1, Figure S8), which allows the model to infer the fitness of new constellations of mutations based on the trajectories of other lineages in which they have previously emerged. This predictive capability is highly desirable from a public health standpoint because forecasts are available as soon as sequences from new lineages appear. To test the reliability of this kind of estimate, we fit leave-one-out estimators on subsets of the dataset with entire PANGO lineages removed (Figure S9). These estimators showed excellent agreement with estimators based on the observed behavior of the lineages, and they were also more accurate than naive phylogenetic estimators that assume the fitness of each new strain is equal to its parent lineage’s fitness (Pearson’s ρ = 0.983, after correcting for parent fitness, Figure S9). These results demonstrate the feasibility of this kind of estimate using the simplest possible linear-additive model, and provide a foundation for future research for more complex modeling that includes effects such as epistasis between mutations and migration across regions.

Table 1:

Amino acid substitutions most significantly associated with increased fitness. Significance is defined as posterior mean / posterior standard deviation. Fitness is per 5.5 days (estimated generation time of the Wuhan (A) lineage (1, 19)). Final column: number of PANGO lineages in which each substitution emerged independently.

Rank	Gene	Substitution	Fold Increase in Fitness	Number of Lineages
1	S	H655Y	1.051	33
2	S	T95I	1.046	30
3	ORF1a	P3395H	1.039	5
4	S	N764K	1.040	6
5	ORF1a	K856R	1.039	2
6	S	S371L	1.041	3
7	E	T91	1.040	5
8	S	Q954H	1.040	5
9	ORF9b	P10S	1.039	25
10	S	L981F	1.040	2
11	N	P13L	1.040	25
12	S	G339D	1.039	4
13	S	S375F	1.040	5
14	S	S477N	1.039	47
15	S	N679K	1.040	11
16	S	S373P	1.040	5
17	M	Q19E	1.039	5
18	S	D796Y	1.038	11
19	S	N969K	1.040	5
20	S	T547K	1.038	3

Unbiased, genome-wide estimates of the effect of SARS-CoV-2 mutations on fitness also provide a powerful tool for better understanding the biology of fitness. Our model allowed us to estimate the contribution of 2904 amino acid substitutions (Figure 3a, Table 1) to lineage fitness and to rank them by inferred statistical significance (Figure S10). Cross-validation confirmed that these results replicate across different geographic regions (Figure S11). The highest concentrations of fitness-associated mutations were found in the S, N, and the ORF1 polyprotein genes (ORF1a and ORF1b, Figures 3A–B, S12–S13). Using spatial autocorrelation as a measure of spatial structure, we found evidence of functional hotspots in the S, N, ORF7a, ORF3a, and ORF1a genes (Table S2). Within S, there were three hotspots of fitness-enhancing mutations, each within a defined functional region: the N-terminal domain, the receptor-binding domain (RBD), and the furin-cleavage site (Figure 3B). We assessed mutational enrichment in the top-ranked set of mutations and identified an enrichment for lysine to asparagine mutations in the S gene (Figure S14C). We visualized top scoring mutations within atomic structures for the spike protein (Figure 3D–E), the nucleocapsid’s N-terminal domain (Figure 3F), the polymerase (Figure S15), and two proteases (Figure S16). Many of the top mutations in the S gene occurred in the receptor binding domain (RBD) making direct contacts with the ACE2 receptor, including K417N/T and E484K (Figures 3D–E). Two top-ranked mutations, T478K and S477N, occur in a flexible loop adjacent to the S-ACE2 interface (Figure 3E), suggesting that these mutations may affect the kinetics of receptor engagement and possibly viral entry. Other mutations occurred in regions proximal to essential enzymatic active sites of the viral replication (Figure S15) or protein processing (Figure S16) machinery.

Figure 3.

Manhattan plot of amino acid changes assessed in this study. A. Changes across the entire genome. B. Changes in the first 850 amino acids of S. In each of A-C the y axis shows effect size Δ log R, the estimated change in log relative fitness due to each amino acid change. The bottom three axes show the background density of all observed amino acid changes, the density of those associated with growth (weighted by |Δ log R|), and the ratio of the two. The top 55 amino acid changes are labeled. See Figure S13 for detailed views of S, N, ORF1a, and ORF1b. C. Changes in the first 250 amino acids of N. D. Structure of the spike-ACE2 complex (PDB: 7KNB). Spike subunits colored light blue, light orange, and gray. Top-ranked mutations are shown as red spheres. ACE2 is shown in magenta. E. Close-up view of the RBD interface. F. Top-ranked mutations in the N-terminal RNA-binding domain of N. Residues 44–180 of N (PDB: 7ACT) are shown in light blue. Amino acid positions corresponding to top mutations in this region are shown as red spheres. A 10-nt bound RNA is shown in gray.

We tested several of the high-scoring mutations in single-cycle infectivity assays as done previously (6), focusing on the RBD (Figure 2A). We found that while some individual mutations increased infectivity, on average high-scoring RBD mutations did not promote infectivity per se. We considered an alternate possibility that fitness of Spike mutations is driven by immune escape. Using RBD-aggregated mutations as a proxy for immune escape, we found that the fitness effect of these Spike mutations correlates well with antibody escape estimates from Greaney et al.(17) (Figure 2B). Together with the observed jump in fitness beginning in late 2021 (Figure 2C) associated with Spike mutations, but not mutations elsewhere in the genome (Figure 2E), these results suggest that immune escape is currently the dominant driver of fitness increases. In contrast to mutations in Spike, those in the serine-arginine rich region of N were linked to increased efficiency of SARS-CoV-2 genomic RNA packaging (18). Within ORF1, we found fitness-associated mutations across all viral enzymes, and clusters within additional non-structural proteins (nsps). The highest concentration of fitness-associated mutations is found in nsp4, nsp6, and nsp12–14 (Figure S12B,S13C–D), suggesting unexplored function at those sites. For example, nsp4 and nsp6 have roles in assembly of replication compartments, and substitutions in these regions may influence the kinetics of replication (see Supplemental Note 3). We note that while convergent evolution makes it possible to identify candidate functional mutations, observational data alone is insufficient to declare mutations as causal rather than merely correlated. For this reason hits identified by our study require functional followup, and can be prioritized by our uncertainty-ranked list of important mutations. In summary, PyR0 provides an unbiased, automated approach for detecting viral lineages with increased fitness. By combining a model-based assessment of lineage fitness with absolute case counts, our model provides a global picture of the events of the first two years of the pandemic. Because it assesses the contribution of individual mutations and aggregates across all lineages and geographic regions, it can identify mutations and gene regions that likely increase fitness, and it can predict the relative fitness of new lineages based solely on viral sequence. Applied to the full set of publicly available SARS-CoV-2 genomes, it provides a principled, unbiased analysis of the mutations driving increased fitness of the virus, identifying experimentally established driver mutations in S and highlighting the key role of non-S mutations, particularly in N, ORF1b, and ORF1a, which have received relatively less research attention. By jointly estimating lineage and mutational fitness from millions of viral sequences across thousands of regions, PyR0 shares statistical strength across regions and mutations to yield mechanistic insight into viral fitness and enhance public health by forecasting lineage dynamics.

34 in total

1. Sequence co-evolution gives 3D contacts and structures of protein complexes.

Authors: Thomas A Hopf; Charlotta P I Schärfe; João P G L M Rodrigues; Anna G Green; Oliver Kohlbacher; Chris Sander; Alexandre M J J Bonvin; Debora S Marks
Journal: Elife Date: 2014-09-25 Impact factor: 8.140

2. Severe acute respiratory syndrome coronavirus nonstructural protein 2 interacts with a host protein complex involved in mitochondrial biogenesis and intracellular signaling.

Authors: Cromwell T Cornillez-Ty; Lujian Liao; John R Yates; Peter Kuhn; Michael J Buchmeier
Journal: J Virol Date: 2009-07-29 Impact factor: 5.103

3. Functional screen reveals SARS coronavirus nonstructural protein nsp14 as a novel cap N7 methyltransferase.

Authors: Yu Chen; Hui Cai; Ji'an Pan; Nian Xiang; Po Tien; Tero Ahola; Deyin Guo
Journal: Proc Natl Acad Sci U S A Date: 2009-02-10 Impact factor: 11.205

4. Structure of M^pro from SARS-CoV-2 and discovery of its inhibitors.

Authors: Zhenming Jin; Xiaoyu Du; Yechun Xu; Yongqiang Deng; Meiqin Liu; Yao Zhao; Bing Zhang; Xiaofeng Li; Leike Zhang; Chao Peng; Yinkai Duan; Jing Yu; Lin Wang; Kailin Yang; Fengjiang Liu; Rendi Jiang; Xinglou Yang; Tian You; Xiaoce Liu; Xiuna Yang; Fang Bai; Hong Liu; Xiang Liu; Luke W Guddat; Wenqing Xu; Gengfu Xiao; Chengfeng Qin; Zhengli Shi; Hualiang Jiang; Zihe Rao; Haitao Yang
Journal: Nature Date: 2020-04-09 Impact factor: 49.962

5. Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding.

Authors: Tyler N Starr; Allison J Greaney; Sarah K Hilton; Daniel Ellis; Katharine H D Crawford; Adam S Dingens; Mary Jane Navarro; John E Bowen; M Alejandra Tortorici; Alexandra C Walls; Neil P King; David Veesler; Jesse D Bloom
Journal: Cell Date: 2020-08-11 Impact factor: 41.582

6. T-CoV: a comprehensive portal of HLA-peptide interactions affected by SARS-CoV-2 mutations.

Authors: Stepan Nersisyan; Anton Zhiyanov; Maxim Shkurnikov; Alexander Tonevitsky
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

7. Increased transmissibility and global spread of SARS-CoV-2 variants of concern as at June 2021.

Authors: Finlay Campbell; Brett Archer; Henry Laurenson-Schafer; Yuka Jinnai; Franck Konings; Neale Batra; Boris Pavlin; Katelijn Vandemaele; Maria D Van Kerkhove; Thibaut Jombart; Oliver Morgan; Olivier le Polain de Waroux
Journal: Euro Surveill Date: 2021-06

8. Severe acute respiratory syndrome coronavirus nonstructural proteins 3, 4, and 6 induce double-membrane vesicles.

Authors: Megan M Angelini; Marzieh Akhlaghpour; Benjamin W Neuman; Michael J Buchmeier
Journal: mBio Date: 2013-08-13 Impact factor: 7.867

9. Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition.

Authors: Allison J Greaney; Tyler N Starr; Pavlo Gilchuk; Seth J Zost; Elad Binshtein; Andrea N Loes; Sarah K Hilton; John Huddleston; Rachel Eguia; Katharine H D Crawford; Adam S Dingens; Rachel S Nargi; Rachel E Sutton; Naveenchandra Suryadevara; Paul W Rothlauf; Zhuoming Liu; Sean P J Whelan; Robert H Carnahan; James E Crowe; Jesse D Bloom
Journal: Cell Host Microbe Date: 2020-11-19 Impact factor: 31.316

10. Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus.

Authors: Bette Korber; Will M Fischer; Sandrasegaram Gnanakaran; Hyejin Yoon; James Theiler; Werner Abfalterer; Nick Hengartner; Elena E Giorgi; Tanmoy Bhattacharya; Brian Foley; Kathryn M Hastie; Matthew D Parker; David G Partridge; Cariad M Evans; Timothy M Freeman; Thushan I de Silva; Charlene McDanal; Lautaro G Perez; Haili Tang; Alex Moon-Walker; Sean P Whelan; Celia C LaBranche; Erica O Saphire; David C Montefiori
Journal: Cell Date: 2020-07-03 Impact factor: 66.850