Literature DB >> 35996593

Recurrent SARS-CoV-2 mutations in immunodeficient patients.

S A J Wilkinson¹, Alex Richter², Anna Casey¹, Husam Osman³, Jeremy D Mirza¹, Joanne Stockton¹, Josh Quick¹, Liz Ratcliffe³, Natalie Sparks¹, Nicola Cumley¹, Radoslaw Poplawski¹, Samuel N Nicholls¹, Beatrix Kele⁴, Kathryn Harris⁴, Thomas P Peacock⁵, Nicholas J Loman¹.

Abstract

Long-term severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections in immunodeficient patients are an important source of variation for the virus but are understudied. Many case studies have been published which describe one or a small number of long-term infected individuals but no study has combined these sequences into a cohesive dataset. This work aims to rectify this and study the genomics of this patient group through a combination of literature searches as well as identifying new case series directly from the COVID-19 Genomics UK (COG-UK) dataset. The spike gene receptor-binding domain and N-terminal domain (NTD) were identified as mutation hotspots. Numerous mutations associated with variants of concern were observed to emerge recurrently. Additionally a mutation in the envelope gene, T30I was determined to be the second most frequent recurrently occurring mutation arising in persistent infections. A high proportion of recurrent mutations in immunodeficient individuals are associated with ACE2 affinity, immune escape, or viral packaging optimisation. There is an apparent selective pressure for mutations that aid cell-cell transmission within the host or persistence which are often different from mutations that aid inter-host transmission, although the fact that multiple recurrent de novo mutations are considered defining for variants of concern strongly indicates that this potential source of novel variants should not be discounted.

Entities: Chemical

Keywords: SARS-CoV-2; convergent evolution; genomics; immunodeficiency; persistent infection; variant emergence

Year: 2022 PMID： 35996593 PMCID： PMC9384748 DOI： 10.1093/ve/veac050

Source DB: PubMed Journal: Virus Evol ISSN： 2057-1577

Introduction

Long-term SARS-CoV-2 infections in immunodeficient patients are important, but understudied (Moran et al. 2021). Evolution of viruses during long-term infection is an important source of novel variation and is thought to be a key influence on the evolutionary dynamics of SARS-CoV-2 generally, and the emergence of new variants specifically. Notably Alpha and Omicron, which were responsible for recent epidemic waves globally, are hypothesised by some to have arisen during long-term infections (Rambaut et al. 2020; Msomi et al. 2021). The Alpha variant (B.1.1.7) emerged abruptly with a constellation of novel mutations and a long branch length from its nearest common ancestor in the B.1.1 clade, during a time of extremely high surveillance in the UK (Rambaut et al. 2020). A likely explanation is that the Alpha variant evolved within a single long-term host over a long period before emergence back into the general population. Evolution during long-term infection has been associated with the rapid accumulation of many mutations within a short period (Avanzato et al. 2020; Choi et al. 2020; Baang et al. 2021; Jensen et al. 2021; Karim et al. 2021; Peacock et al. 2021; Riddell et al. 2022). The Beta (B.1.351), Gamma (P.1), and Omicron (B.1.1.529) variants all emerged in similar circumstances to alpha, potentially suggesting that they also emerged from long-term infections. To better understand evolutionary pressures associated with viral evolution during long-term infections, a dataset composed of 168 SARS-CoV-2 genomes was compiled to examine the frequency of recurrent mutations. These genomes were associated with twenty-eight patients with a range of conditions that result in immunodeficiency significant enough to prevent rapid viral clearance. This builds upon previous work performing a similar analysis using case studies that included a total of ten patients (Peacock et al. 2021). This analysis expands on that work by utilising a significantly larger dataset which increases the power, also many of the cases included are the alpha variant which have not been discussed in the context of long-term SARS-CoV-2 cases previously and potentially gives insight into future variant emergence, and lastly all genome series were analysed using a single analysis pipeline.

Methods

Dataset assembly

Patient-associated genome series were selected for inclusion via a literature search for case studies using the following search terms and filters: After 2019, ‘SARS-CoV-2’, ‘nCoV-2019’, ‘Immunodeficient’, ‘Immunocompromised’, ‘long-term’, all searches took place between the dates 1 August 2021 and 30 November 2021. Other genome series were extracted from the COG-UK dataset, a UK-wide genomic surveillance repository (COVID-19 Genomics UK (COG-UK) 2020; Nicholls et al. 2021). Genome series were only included if they met the following criteria: at least two genomes available on either public databases or via a request, evidence of long-term viral infection for a period no less than 28 days (some genome series covered a shorter period but the clinical information met this criterion), clinical information available was sufficient to indicate the nature of the patient’s immune deficiency. For all genome series included in the dataset, a Civet report (O’Toole et al. 2021a) was generated using Civet v3.0. These reports confirm that all genomes were the result of long-term infections rather than a superinfection or independent infection events by virtue of individual genomes sharing a recent common ancestor with a step-wise accumulation of mutations over time. A single genome from patient 11 was excluded due to a probable superinfection as described by (Tarhini et al. 2021). Figures were generated for each phylogeny generated with civet using ggtree (Yu et al. 2018) and are included within the supplementary material. Genomes included in the dataset were obtained from: (Choi et al. 2020; Avanzato et al. 2020; Reuken et al. 2021; Tarhini et al. 2021, Kemp et al. 2021 Baang et al. 2021; Stanevich et al. 2021; Khatamzas et al. 2021; Borges et al. 2021; Riddell et al. 2022; Ciuffreda et al. 2021; Jensen et al. 2021; Weigang et al. 2021). A full description of the dataset is available within the supplementary material of this article. When a genome series was selected for inclusion all genomes were placed within an individual multi-fasta file with a header identifying the patient via an identifier (‘pt-1’, ‘pt-2’, etc.) and the number of days passed since the initial genome available within that genome series (the day 0 genome), in several cases this genome was collected after a lengthy period of active infection but only the time period covered by the genome series was considered in the analysis.

Mutation calling of genomes

Mutation calling was automated with an R script adapted from (Mercatelli et al. 2021) which utilises Nucleotide mummer (NUCmer) (Marçais et al. 2018) for genome alignment to an annotated SARS-CoV-2 reference sequence (Wu et al. 2020) and defines Single Nucleotide Polymorphisms (SNPs), insertions, deletions, frameshifts, and inversions relative to this reference sequence (NCBI accession NC_045512.2). One change was made to the annotations of the reference in the case of the ORF1ab polyprotein gene non-structural protein12 (NSP12) where the position was adjusted by a single nucleotide so that all mutation calls would be relative to the reading frame post the ribosomal frameshift for simplicity; zero mutations were detected in the pre-ribosomal frameshift region of NSP12, therefore, no mutations were incorrectly annotated as a result.

De novo mutation cumulative occurrence analysis pipeline

Processing of the mutation calls was performed with a Python script (https://github.com/BioWilko/recurrent-sars-cov-2-mutations/blob/main/mutation_call_analysis.py) to investigate de novo mutations (DNMs). A DNM was defined as observed mutations within a genome series that were not present at day 0 of the genome series. It should be noted it is possible a subset of the mutation present at day 0 could have arisen in the chronic patients prior to the first sequence being found and would therefore not be included in this analysis. DNMs which reverted to the day 0 base were still counted as a DNM occurrence within a genome series since they did indeed occur. Further to this a recurrent mutation was defined as a DNM which was observed to occur within more than one genome series. A cumulative count of each observed DNM was performed for each day between 0 and the maximum genome series length (218 days). When a deletion was observed all deletions with a reference position within eighteen nucleotides of the reference position of the initial deletion regardless of length or position were clustered as a single region. Ambiguous nucleotides were not considered in mutation calling. The resultant dataframe was finally formatted with an R script and figures generated using ggplot2 (Wickham 2016).

Results

The SARS-CoV-2 spike gene (S) demonstrated the greatest number of recurrent mutations in the dataset (Fig. 2, Fig. 1) with ten substitutions—S:S13I, S:T95I, S:G142V, S:L452R, S:E484K, S:E484G, S:F486I, S:F490L, S:Q493K, and S:Q498R. The domain where the highest number of DNM occurrences were observed was the RBD with seven, followed by the NTD with five, and the SP with one for a total of thirteen. Clustering mutations by AA loci additionally revealed the following sites as notable: S:484, S:501, S:330, and S:440. The domain with the highest number of AA loci with DNMs was the RBD with nine, followed by the NTD with five, and the SP with one. The most frequently occurring DNM was S:E484K with eight occurrences, when all DNMs at the S:484 locus are clustered (Fig. 2); the number of occurrences is increased to twelve clearly demonstrating an enrichment of DNMs at this locus. The DNMs at the locus S:484 consist of: eight S:E484K, two S:E484G, and one each of S:E484Q, and S:E484A. AA loci clustering highlighted the loci S:330, S:440, and S:501 as recurrent for DNMs (≥ two occurrences in the period).

Figure 2.

Cumulative occurrences of non-synonymous recurrent de novo mutations in S-gene divided by gene domain in 168 genomes obtained from twenty-eight patients. Substitution mutations were clustered by amino acid loci, this is notated with the International Union of Pure and Applied Chemistry (IUPAC) ambiguity code X to indicate any possible amino acid, lines for cumulative sites are dashed for easier differentiation. Only loci that were notable when clustered (significant difference with non-clustered equivalent or loci not highlighted without clustering) were included in the figure. Mutations were observed in the following domains: NTD, receptor-binding domain (RBD), and the SP (Xia 2021). Deletions (Δ) were clustered within a window of six amino acids (AA) regardless of length or position of deletion; full details of the breakdown can be found at https://github.com/BioWilko/recurrent-sars-cov-2-mutations/blob/main/dataset/mutation_calls.csv. The first genome from each patient was considered to be day 0. The sampling periods and frequencies within the dataset were highly variable, 218 days was the longest time period covered within the dataset but the majority were much shorter, the full details of the dataset are available in Supplementary Table S1. All recurrent de novo mutations were labelled on the graph.

Distribution of de novo mutations included in this study across the entire SARS-CoV-2 genome. Schematic of SARS-CoV-2 genome with relevant ORFs annotated. DNMs with the highest frequency annotated by amino acid position and substitutions—X indicates multiple amino acids form DNMs at this position. Cumulative occurrences of non-synonymous recurrent de novo mutations in S-gene divided by gene domain in 168 genomes obtained from twenty-eight patients. Substitution mutations were clustered by amino acid loci, this is notated with the International Union of Pure and Applied Chemistry (IUPAC) ambiguity code X to indicate any possible amino acid, lines for cumulative sites are dashed for easier differentiation. Only loci that were notable when clustered (significant difference with non-clustered equivalent or loci not highlighted without clustering) were included in the figure. Mutations were observed in the following domains: NTD, receptor-binding domain (RBD), and the SP (Xia 2021). Deletions (Δ) were clustered within a window of six amino acids (AA) regardless of length or position of deletion; full details of the breakdown can be found at https://github.com/BioWilko/recurrent-sars-cov-2-mutations/blob/main/dataset/mutation_calls.csv. The first genome from each patient was considered to be day 0. The sampling periods and frequencies within the dataset were highly variable, 218 days was the longest time period covered within the dataset but the majority were much shorter, the full details of the dataset are available in Supplementary Table S1. All recurrent de novo mutations were labelled on the graph. The only recurrent deletions observed in the dataset were located within the NTD of S-gene: S:Δ67 region (recurrent deletion region 1/RDR1), S:Δ138 region (RDR2), and S:Δ243 region (RDR4) (McCarthy et al. 2021). S:Δ138 region was the most frequent with four occurrences, followed by S:Δ67 region and S:Δ138 region with two occurrences, respectively. Deletions within the S:Δ67 region consisted of one S:Δ67 and one S:Δ69–70, the unconventional annotation is the result of the algorithm utilised to cluster deletions, the genome series in which S:Δ67 occurred already possessed S:Δ69 in its day 0 genome. S-gene constitutes just over one-eighth of the overall SARS-CoV-2 genome by length; despite this, ∼34 per cent (79/234) of the total DNM occurrences were observed within S-gene as well as 59 per cent (13/22) of the recurrent DNMs. Non-spike, non-ORF1ab SARS-CoV-2 genes demonstrated a lower number of DNM occurrences (Fig. 3, Fig. 1). Three mutations within Matrix (M) and Envelope (E) were notable in their frequency (≥ 2 occurrences in the period): E:T30I and M:H125Y. E:T30I was the only recurrent DNM observed within E-gene and the second most frequent DNM revealed by the analysis overall at six occurrences. E:T30I occurrences were not observed to be associated with any particular source study, geographical region, or SARS-CoV-2 lineage suggesting this may be a sensitive marker for persistent infection. Within M-gene, M:H125Y was the only recurrent DNM with four occurrences. Cumulative occurrences of non-synonymous recurrent DNMs in genes other than S or ORF1ab subdivided by gene in 168 genomes obtained from 28 patients. Recurrent DNMs were observed in E (encodes envelope protein) and M (encodes membrane glycoprotein) genes, the full details of the gene definitions used are available from (Wu et al. 2020). The first genome from each patient was considered to be day 0. The sampling periods and frequencies within the dataset were highly variable, 218 days was the longest time period covered within the dataset but the majority were much shorter, the full details of the dataset are available in Supplementary Table S1. All recurrent DNMs were labelled on-graph. When DNMs observed in these genes were clustered by AA loci the findings remained almost entirely unchanged other than in the case of the locus M:2 which was raised to three DNM occurrences by day 218 rather than the two presented in (Fig. 3).

Figure 3.

Cumulative occurrences of non-synonymous recurrent DNMs in genes other than S or ORF1ab subdivided by gene in 168 genomes obtained from 28 patients. Recurrent DNMs were observed in E (encodes envelope protein) and M (encodes membrane glycoprotein) genes, the full details of the gene definitions used are available from (Wu et al. 2020). The first genome from each patient was considered to be day 0. The sampling periods and frequencies within the dataset were highly variable, 218 days was the longest time period covered within the dataset but the majority were much shorter, the full details of the dataset are available in Supplementary Table S1. All recurrent DNMs were labelled on-graph.

ORF1ab polyprotein genes, constituting many NSPs within SARS-CoV-2, demonstrated a larger number of recurrent mutations but still far fewer than in spike (Fig. 4). Six DNMs were notable for their occurrence frequency: NSP3:T504P, NSP3:T820I, NSP3:P822L, NSP3:K977Q, NSP4:T295I, and NSP12:V792I. ORF1ab contained 86 out of the 195 DNMs observed, but only six of the total of twenty-one of the recurrent DNMs ORF1ab constitutes more than two-thirds of the overall SARS-CoV-2 genome by length making the number of overall DNMs within the polyprotein disproportionately lower than would be expected if the distribution were random.

Figure 4.

Cumulative occurrences of non-synonymous recurrent DNMs in ORF1ab polyprotein subdivided by gene in 168 genomes obtained from 28 patients. The first genome from each patient was considered to be day 0. The sampling periods and frequencies within the dataset was highly variable, 218 days was the longest time period covered within the dataset but the majority were much shorter, the full details of the dataset are available in Supplementary Table S1. All recurrent DNMs were labelled on-graph. When DNMs observed within ORF1ab were clustered by AA loci the overall shape of the results remain broadly identical with two exceptions: NSP3:T504 and NSP3:P822 where their day 218 occurrences are raised to 3 and 4, respectively. The relative frequencies for each recurrent mutation observed in the DNM occurrence analysis were compared to their prevalence within the COG-UK dataset (on 23 November 2021) (Table 1). As in the initial analysis S:E484K, E:T30I, and M:H125Y are noteworthy in their frequency especially compared to their low frequency in the larger COG-UK dataset.

Table 1.

DNM annotation	Frequency in DNM occurrence analysis	Frequency in COG-UK dataset	Percentage of genome series in which DNM occurred	Percentage of genomes in COG-UK with DNM
S:E484K	8	3,437	28.57%	0.2180%
E:T30I	6	208	21.42%	0.0132%
M:H125Y	4	2,188	14.29%	0.1387%
S:Δ138 region	4	283,289	14.29%	17.9645%
NSP4:T295I	3	1,933	10.71%	0.1226%
S:Q493K	3	59	10.71%	0.0037%
S:Δ67 region	2	292,969	7.14%	18.5783%
S:S13I	2	211	7.14%	0.0134%
NSP12:V792I	2	10	7.14%	0.0006%
NSP3:P822L	2	28,410	7.14%	1.8016%
NSP3:T820I	2	442	7.14%	0.0280%
NSP3:T504P	2	18	7.14%	0.0011%
S:L452R	2	1,010,866	7.14%	64.1029%
S:Q498R	2	225	7.14%	0.0143%
S:E484G	2	46	7.14%	0.0029%
S:Δ243 region	2	546	7.14%	0.0346%
S:F486I	2	6	7.14%	0.0004%
S:G142V	2	1,361	7.14%	0.0863%
S:T95I	2	682,286	7.14%	43.2664%
NSP3:K977Q	2	391	7.14%	0.0248%
S:F490L	2	463	7.14%	0.0294%

DNM occurrence frequencies for all recurrent DNMs in this analysis and the COG-UK dataset (n = 1,576,942). COG-UK dataset figures were generated using the dataset as it existed on 7 December 2021. Data was generated via CLIMB-Covid (Nicholls et al. 2021). The COG-UK dataset was used due to the quality of metadata available as a background dataset as well as programmatic access to variant information through existing CLIMB-COVID tools. Each observed recurrent DNM was compared to the UKHSA VOC/VUI definition files (Table 2). S:E484K was the most frequent DNM to appear in VOC/VUI definitions with eleven appearances, then S:L452R with four, then S:T95I and S:Δ138/RDR2 region with three each, followed by NSP3:K977Q, NSP3:P822L, S:Q498R, S:Δ67/RDR1 region, and S:Δ243/RDR4 region with one each. Of the twenty-one recurrent DNMs observed in the analysis nine of them are considered defining mutations for a VOC/VUI.

Table 2.

Recurrent mutations which are variant defining based upon United Kingdom Health Security Agency (UKHSA) variant definitions. Variant definitions were parsed from the UKHSA variant definition files available at: https://github.com/phe-genomics/variant_definitions. Lineages were called using pangolin (O’Toole et al. 2021b).

Mutation annotation	Pango lineage	UKHSA label	WHO label
NSP3:K977Q	P.1	VOC-21JAN-02	Gamma
NSP3:P822L	AV.1	VUI-21MAY-01	n/a
S:E484K	B.1.351	VOC-20DEC-02	Beta
S:E484K	B.1.525	VUI-21FEB-03	Eta
S:E484K	P.1	VOC-21JAN-02	Gamma
S:E484K	A.23.1	VUI-21FEB-01	n/a
S:E484K	AV.1	VUI-21MAY-01	n/a
S:E484K	B.1.1.318	VUI-21FEB-04	n/a
S:E484K	B.1.1.7 (with E484K)	VOC-21FEB-02	n/a
S:E484K	B.1.324.1	VUI-21MAR-01	n/a
S:E484K	P.3	VUI-21MAR-02	Theta
S:E484K	P.2	VUI-21JAN-01	Zeta
S:E484K	B.1.621	VUI-21JUL-01	n/a
S:L452R	B.1.617.2	VOC-21APR-02	Delta
S:L452R	B.1.617.1	VUI-21APR-01	Kappa
S:L452R	B.1.617.3	VUI-21APR-03	n/a
S:L452R	C.36.3	VUI-21MAY-02	n/a
S:Q498R	BA.1	VOC-21NOV-01	Omicron
S:T95I	AV.1	VUI-21MAY-01	n/a
S:T95I	B.1.1.318	VUI-21FEB-04	n/a
S:T95I	B.1.621	VUI-21JUL-01	n/a
S:Δ67 region/RDR1	B.1.1.7	VOC-20DEC-01	Alpha
S:Δ138 region/RDR2	B.1.1.7	VOC-20DEC-01	Alpha
S:Δ138 region/RDR2	AV.1	VUI-21MAY-01	n/a
S:Δ138 region/RDR2	B.1.1.318	VUI-21FEB-04	n/a
S:Δ243 region/RDR4	C.37	VUI-21JUN-01	Lambda

Discussion

Not all mutations are discussed in detail, while a literature search has been performed for every recurrent DNM only those with sufficient literature available for discussion to be informative were included below.

S-gene—RBD recurrent mutations

The frequency of RBD DNMs observed in this analysis is a significant finding; the RBD is a relatively small region of the SARS-CoV-2 genome making up less than 2 per cent of the genome by length, but these account for 17 per cent of all DNMs observed (Fig. 1). It is clear that RBD mutations were the most strongly selected for in the immunocompromised patients included within the dataset.

Figure 1.

The sharp rise of S:E484K occurrences early in the period is biased due to the data from Jensen et al. (2021) as a result of their sampling strategy and research focus. Jensen et al. (2021) specifically discussed the emergence of S:E484K in long-term immunocompromised patients and published short periods of surveillance of these cases when the patients in question had significantly longer shedding periods to demonstrate this. However, even if this study is excluded S:E484K remains the most frequently occurring DNM within spike. The high frequency of the S:E484K occurrences is suggestive of a strong selective pressure; this is further demonstrated by the total of twelve DNMs observed at the S:484 locus. The two occurrences of S:E484G in the dataset also suggest that the glycine substitution is subject to differing selection pressures than the lysine substitution in S:E484K although this may be host dependent. In one of the two occurrences of S:E484G this change was transient and was replaced by S:E484K. There are two possible explanations for this observation: a secondary mutation or both mutations occurred within the patient and the S:E484K subpopulation outcompeted the S:E484G population to become dominant. There is no single nucleotide change by which a G -> K AA change might occur, supporting the second possibility. If the second explanation is correct it would suggest that S:484 mutations are selected for generally. The large difference between the frequency of S:E484K in this dataset compared to the national COG-UK dataset further suggests that the selection pressures which caused S:E484K to be so frequent within this analysis are not true of the majority of hosts (Table 1). S:E484K is also considered a defining mutation for a large number of variants, further indicating a strong selection pressure for the mutation (Table 2). Despite its presence within a large number of variants it is only present within a small proportion of the COG-UK dataset suggesting that on a population level it may have a deleterious effect on transmission. Although this may be explained by other factors such as variants with S:E484K not being common in the UK generally. A strong selective pressure for S:E484K was also observed by Zahradník et al. (2021) who discovered using an in vitro experimental evolution model, that >70 per cent of clones in one library gained S:E484K and S:N501Y which were associated with a significant increase in ACE2 affinity. Furthermore they observed the occurrence of the mutation S:Q498R alongside S:N501Y in two repeats, this combination was observed to lead to significantly greater affinity to ACE2 compared to both wild-type and Alpha which rose further alongside S:E484K. This combination was only observed within a single patient (patient 19) although the combination E484G, Q498R, and N501Y did arise in a further patient (patient 17); in both cases the infections were Alpha and therefore already possessed S:N501Y. At the time of this publication that constellation of mutations had not been observed in wild virus but with the emergence of Omicron, this combination has become significantly more frequent (albeit with E484A rather than E484K). The low occurrence frequency of S:N501Y compared to that observed by Zahradník et al. (2021) is also notable but is partly explained by its high (nine out of twenty-eight) day 0 frequency in the genome series, due to the high amount of long-term Alpha infections included in this study. When DNMs were clustered by AA locus S:501 was highlighted as recurrent, however. Another notable observation is the two de novo occurrences of S:L452R (a defining mutation of Delta, Kappa, and Epsilon variants) which aids both immune evasion and ACE2 affinity (Motozono et al. 2021). S:Q493K has previously been identified by Huang et al. (2021) as a highly beneficial adaptation to a mouse host, improving spike binding affinity to murine ACE2 (Huang et al. 2021), its rarity in the overall SARS-CoV-2 population (58 in COG-UK dataset) suggests that it is not strongly selected for in a human host generally. The three occurrences in this dataset may suggest that S:Q493K does confer a benefit to the virus within the context of a long-term infection but not in transient infection. A highly similar mutation, S:Q493R, is a defining mutation of the Omicron variant. S:F486I has been observed to decrease the affinity of some neutralising antibodies to spike protein (Xu et al. 2021), and may decrease the affinity of spike to ACE2 (Clark et al. 2021). S:F486I has furthermore been associated with mink adaptation (Zhou et al. 2021). S:490 L has been observed to reduce the affinity of multiple mAbs as well as decrease the neutralisation sensitivity of pseudovirus to convalescent sera, however, it does not appear to have an impact on viral infectivity (Li et al. 2020). It is noteworthy that a large number of mutations described in this present study are associated with enhanced human ACE2 affinity including Q493K, Q498R and N501Y (Starr et al. 2020). When AA loci clustering was performed recurrent DNMs at S:330 and S:440 were observed. Finally, although most of this study has considered mutations in isolation, several of the late stage long-term infections showed interesting combinations of mutations, particularly within Spike (Fig. 5). Patient 19 for example was an Alpha infection that had picked up a large number of mutations, many of which were in common with, or similar to Omicron, for example S:A67D, S:G142V, S:T95I, S:Δ210/S:L212I, S:E484K, and S:Q498R. A further case, patient 17 also contained S:E484G and S:Q498R alongside the Alpha lineage-defining mutation, S:N501Y and patient 27 contained S:T95I, a further deletion at S:Δ138 region and S:G496S, in common with Omicron.

Figure 5.

Spike mutational profiles of particular interest described by this study. Select spikes from late sequencing of three long-term Alpha infections shown as Spike schematics. Spike variants from WT Alpha, Delta, and BA.1 Omicron shown for comparison. Mutations shown in grey are existing lineage-defining Alpha mutations. Mutations marked with an asterisk indicate mixed, but resolvable bases in the sequence.

S-gene N-terminal domain recurrent mutations

S:T95I has been show to bind to the human Tyrosine-protein kinase receptor UFO (AXL) and it has been suggested by (Singh et al. 2021) that AXL facilitates SARS-CoV-2 cell entry to the same extent as ACE2 in AXL overexpressed cell culture. NTD also has a substantial role in the antigenicity of spike with multiple escape mutations identified in this domain (Harvey et al. 2021). All recurrent deletions within the SARS-CoV-2 genome were observed within the NTD (S:Δ67 region/RDR1, S:Δ138/RDR2 region, and S:Δ243/RDR4 region). Deletions within the S:69–70 region are commonly observed (McCarthy et al. 2021; Meng et al. 2021). Meng et al. (2021) characterised the common S:Δ69–70 deletion as contributing to infectivity by improving incorporation of cleaved spike protein into virions and possibly has a compensatory effect on mutations in the RBD associated with Ab escape such as S:N439K and S:Y453F. Of the two observations of deletions within the S:67–70 region, one was S:Δ69–70 whereas the other was S:Δ67 which has not been commonly observed, but it is notable that the genome series in which S:Δ67 was observed already possessed S:Δ69 at day 0. S:Δ69–70 is also a defining mutation of the Alpha and Omicron variants and is responsible for the S-gene target failure observed in the PCR testing of alpha variant samples with TaqPath SARS-CoV-2 PCR kits (Kidd et al. 2021). De novo occurrences of slightly differing deletions within the S:Δ138/RDR2 region were observed four times. This region makes up part of the ‘NTD antigenic supersite’ which is the majority of neutralising antibodies against the NTD target (McCallum et al. 2021b). S:Δ140 has consequently been associated with a significant decrease in Ab neutralisation (Andreano et al. 2021; Liu et al. 2021). Based on the high number of occurrences, it appears likely that deletions in this region confer some benefit to the virus during long-term infections. As with S:N501Y, as well as S:Δ67 region, it is worth noting a substantial proportion of long-term infections already carried deletions in the S:Δ138 region at day 0 due to being the Alpha variant. Two occurrences of S:Δ243, another NTD supersite mutation, were also observed, another deletion that has been demonstrated to decrease Ab neutralisation in vitro (McCarthy et al. 2021; McCallum et al. 2021b).

S-gene SP recurrent mutations

The single recurrent SP DNM, S:S13I, has been previously shown to mediate a shift of the cleavage site of the SP which in turn facilitates immune evasion by causing a significant re-arrangement of the NTD antigenic supersite and its constituent internal disulphide bonding (McCallum et al. 2021a, 2021b).

E-gene recurrent mutations

The most frequent DNM observed outside of the spike gene is Envelope:T30I (the second most frequent mutation overall after S:E484X). This mutation was observed by Chaudhry et al. (2020) in a cell-culture passage experiment, where it conferred a growth advantage in Calu-3 cells but slowed growth in Vero E6 cells (Chaudhry et al. 2020). The high frequency of E:T30I is strongly suggestive of a selective pressure during long-term infections and further suggests that the conditions experienced by the virus in immunocompromised patients may exist in a similar selective environment as cell culture, potentially due to a lack of stability needed for transmission. The significant enrichment of E:T30I in this analysis compared to the COG-UK dataset (Table 2) suggests that E:T30I may be a deleterious mutation within the circulating SARS-CoV-2 population. A single variant lineage, B.1.616, does contain E:T30I as a lineage-defining mutation. Interestingly, B.1.616 was associated with an extremely localised, largely nosocomial-associated outbreak, suggesting the possibility this may have been the emergence of a virus from a long-term infection (Fillâtre et al. 2021). This also raises the hypothetical possibility that E:T30I may be considered a marker of long-term SARS-CoV-2 infections. Further study is necessary to determine the phenotypic effect of this mutation and its role in influencing within- and between-host fitness.

ORF1ab-NSP3 recurrent mutations

Literature concerning mutations in ORF1ab is generally observational rather than experimental due to the current lack of tractable models to study them in vitro. The concentration of higher frequency mutations within the NSP3 gene is not surprising considering it is the largest gene within the ORF1ab polyprotein and is known to be a bulky, modular protein that may have some flexible linker regions which are fairly hypermutatable. Stanevich et al identified NSP3:T504P as a mutation associated with cytotoxic T cell epitope immune escape (Stanevich et al. 2021).

Conclusions

This work sought to determine recurrent mutations across the SARS-CoV-2 genome associated with long-term infections in immunodeficient patients. This study has several notable limitations: importantly a significant publication bias is likely to be present which may overemphasise the importance of some mutations. S:E484K especially is affected by this, the six genome series obtained from Jensen et al. (2021) were published to demonstrate the emergence of S:E484K within immunocompromised patients. Further work will attempt to avoid this by utilising less-biased sampling strategies from long-term infected patients, requiring a prospective study design that aims to regularly sample genomes from long-term infected patients. Another potential limitation is the use of the COG-UK dataset (Nicholls et al. 2021) as a background dataset considering that ten out of twenty-eight patients were located within the UK (Table 1). The COG-UK dataset is limited to SARS-CoV-2 genomes collected within the UK, but was still used due to the richness of associated metadata within this dataset as well as programmatic access to variant database information provided via CLIMB-COVID (Nicholls et al. 2021). It is also likely that DNMs occurred before the day 0 genomes for the genome series, but without genome sequences it is difficult to judge whether any observed, non-lineage defining mutations occurred within the patient or prior to their infection. The majority of recurrently observed DNMs have been associated with immune escape, increased ACE2 affinity, or improved viral packaging and are generally not highly prevalent within the wider SARS-CoV-2 population (with the exception of some SARS-CoV-2 variants). Many recurrent DNMs identified in this work have been observed to occur during experiments investigating spike selection in various models as well as efforts to identify immune escape mutations. These factors suggest that the conditions during long-term infections at least partly select for mutations which aid the virus with intra-host replication (cell–cell transmission) and persistence as opposed to the general SARS-CoV-2 population, where mutations which aid inter-host transmission are more strongly selected for. E:T30I in particular is worthy of further study as a potential marker of long-term SARS-CoV-2 infections. However, the large number of occurrences overlapping with variant defining mutations observed does indicate that patients within this category should not be discounted as a potential source of previous, or indeed future variants. The potential of mutations which aid cell–cell transmission within the host or improve viral packaging may affect virulence and any mutations within this category which do not impact viral transmissibility could have a significant impact. This is highly relevant as many of the most abundant mutations described in this dataset are found across many variant lineages. Furthermore, it is possible sub-neutralising levels of antibodies which may be present in some cases (either homologous or from heterologous convalescent or monoclonal antibody treatments) could be selecting for the acquisition of antigenic mutations observed (Kemp et al. 2021). At present it is unresolved where SARS-CoV-2 variants emerge from. One prevailing hypothesis is that some variants emerged from long-term chronic infections, generating novel advantageous combinations of mutations without the stringent selection pressure of transmission, eventually resulting in an outbreak and onward transmission. We have compared common mutations arising during chronic infections and described how many are shared with SARS-CoV-2 variant lineages. Furthermore we present evidence, based on a rare mutational signature, that the French B.1.616 variant lineage arose from a direct and recent spillover from a chronic infection. Overall the data presented here is consistent and supportive of the chronic infection hypothesis of SARS-CoV-2 variant emergence. Therefore we suggest identifying and curing chronic infections, preferably with combined antiviral therapy as would be used for more traditionally chronic viruses Human Immunodeficiency Virus (HIV), Hepatitis C Virus (HCV) both to the infected individual, but also to global health. Intra-host variation of SARS-CoV-2 is likely to play a significant role within this patient group however the lack of raw data availability for the majority of the samples within this dataset makes this challenging (Chaudhry et al. 2020). We anticipate this dataset will be maintained as a public resource to enable the study of long-term SARS-CoV-2 infections in immunodeficient patients for as long as it is deemed relevant to enable other researchers to contribute to this understudied, highly important, patient group (https://github.com/BioWilko/recurrent-sars-cov-2-mutations/blob/main/dataset/mutation_calls.csv). Click here for additional data file.

40 in total

1. Long-Term Evolution of SARS-CoV-2 in an Immunocompromised Patient with Non-Hodgkin Lymphoma.

Authors: Vítor Borges; Joana Isidro; Mário Cunha; Daniela Cochicho; Luís Martins; Luís Banha; Margarida Figueiredo; Leonor Rebelo; Maria Céu Trindade; Sílvia Duarte; Luís Vieira; Maria João Alves; Inês Costa; Raquel Guiomar; Madalena Santos; Rita Cortê-Real; André Dias; Diana Póvoas; João Cabo; Carlos Figueiredo; Maria José Manata; Fernando Maltez; Maria Gomes da Silva; João Paulo Gomes
Journal: mSphere Date: 2021-07-28 Impact factor: 4.389

2. Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding.

Authors: Tyler N Starr; Allison J Greaney; Sarah K Hilton; Daniel Ellis; Katharine H D Crawford; Adam S Dingens; Mary Jane Navarro; John E Bowen; M Alejandra Tortorici; Alexandra C Walls; Neil P King; David Veesler; Jesse D Bloom
Journal: Cell Date: 2020-08-11 Impact factor: 41.582

3. Severe clinical relapse in an immunocompromised host with persistent SARS-CoV-2 infection.

Authors: Philipp A Reuken; Andreas Stallmach; Mathias W Pletz; Christian Brandt; Nico Andreas; Sabine Hahnfeld; Bettina Löffler; Sabine Baumgart; Thomas Kamradt; Michael Bauer
Journal: Leukemia Date: 2021-02-19 Impact factor: 11.528

Review 4. Domains and Functions of Spike Protein in Sars-Cov-2 in the Context of Vaccine Design.

Authors: Xuhua Xia
Journal: Viruses Date: 2021-01-14 Impact factor: 5.048

Review 5. SARS-CoV-2 variants, spike mutations and immune escape.

Authors: William T Harvey; Alessandro M Carabelli; Ben Jackson; Ravindra K Gupta; Emma C Thomson; Ewan M Harrison; Catherine Ludden; Richard Reeve; Andrew Rambaut; Sharon J Peacock; David L Robertson
Journal: Nat Rev Microbiol Date: 2021-06-01 Impact factor: 78.297

6. MUMmer4: A fast and versatile genome alignment system.

Authors: Guillaume Marçais; Arthur L Delcher; Adam M Phillippy; Rachel Coston; Steven L Salzberg; Aleksey Zimin
Journal: PLoS Comput Biol Date: 2018-01-26 Impact factor: 4.475

7. The Impact of Mutations in SARS-CoV-2 Spike on Viral Infectivity and Antigenicity.

Authors: Qianqian Li; Jiajing Wu; Jianhui Nie; Li Zhang; Huan Hao; Shuo Liu; Chenyan Zhao; Qi Zhang; Huan Liu; Lingling Nie; Haiyang Qin; Meng Wang; Qiong Lu; Xiaoyu Li; Qiyu Sun; Junkai Liu; Linqi Zhang; Xuguang Li; Weijin Huang; Youchun Wang
Journal: Cell Date: 2020-07-17 Impact factor: 41.582

8. N-terminal domain of SARS CoV-2 spike protein mutation associated reduction in effectivity of neutralizing antibody with vaccinated individuals.

Authors: Yogendra Singh; Neeraj Kumar Fuloria; Shivkanya Fuloria; Vetriselvan Subramaniyan; Dhanalekshmi Unnikrishnan Meenakshi; Srikumar Chakravarthi; Usha Kumari; Navneet Joshi; Gaurav Gupta
Journal: J Med Virol Date: 2021-07-07 Impact factor: 2.327

1 in total

1. Where is the next SARS-CoV-2 variant of concern?

Authors: John J Dennehy; Ravindra K Gupta; William P Hanage; Marc C Johnson; Thomas P Peacock
Journal: Lancet Date: 2022-05-21 Impact factor: 202.731

1 in total