Joel O Wertheim1, Jade C Wang2, Mindy Leelawong3, Darren P Martin4, Jennifer L Havens5, Moinuddin A Chowdhury3, Jonathan E Pekar5, Helly Amin3, Anthony Arroyo3, Gordon A Awandare6, Hoi Yan Chow3, Edimarlyn Gonzalez3, Elizabeth Luoma7, Collins M Morang'a6, Anton Nekrutenko8, Stephen D Shank9, Stefan Silver3, Peter K Quashie6, Jennifer L Rakeman3, Victoria Ruiz3, Lucia V Torian3, Tetyana I Vasylyeva10, Sergei L Kosakovsky Pond9, Scott Hughes3. 1. Department of Medicine, University of California San Diego, La Jolla, CA, USA. jwertheim@health.ucsd.edu. 2. New York City Public Health Laboratory, New York City Department of Health and Mental Hygiene, New York, NY, USA. jwang7@health.nyc.gov. 3. New York City Public Health Laboratory, New York City Department of Health and Mental Hygiene, New York, NY, USA. 4. Institute of Infectious Diseases and Molecular Medicine, University of Cape Town, Cape Town, South Africa. 5. Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, USA. 6. West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), University of Ghana, Accra, Ghana. 7. Bureau of the Communicable Diseases, New York City Department of Health and Mental Hygiene, Long Island City, NY, USA. 8. Department of Biochemistry and Molecular Biology, The Pennsylvania State University, State College, PA, USA. 9. Department of Biology, Temple University, Philadelphia, PA, USA. 10. Department of Medicine, University of California San Diego, La Jolla, CA, USA.
Abstract
Recombination is an evolutionary process by which many pathogens generate diversity and acquire novel functions. Although a common occurrence during coronavirus replication, detection of recombination is only feasible when genetically distinct viruses contemporaneously infect the same host. Here, we identify an instance of SARS-CoV-2 superinfection, whereby an individual was infected with two distinct viral variants: Alpha (B.1.1.7) and Epsilon (B.1.429). This superinfection was first noted when an Alpha genome sequence failed to exhibit the classic S gene target failure behavior used to track this variant. Full genome sequencing from four independent extracts reveals that Alpha variant alleles comprise around 75% of the genomes, whereas the Epsilon variant alleles comprise around 20% of the sample. Further investigation reveals the presence of numerous recombinant haplotypes spanning the genome, specifically in the spike, nucleocapsid, and ORF 8 coding regions. These findings support the potential for recombination to reshape SARS-CoV-2 genetic diversity.
Recombination is an evolutionary process by which many pathogens generate diversity and acquire novel functions. Although a common occurrence during coronavirus replication, detection of recombination is only feasible when genetically distinct viruses contemporaneously infect the same host. Here, we identify an instance of SARS-CoV-2 superinfection, whereby an individual was infected with two distinct viral variants: Alpha (B.1.1.7) and Epsilon (B.1.429). This superinfection was first noted when an Alpha genome sequence failed to exhibit the classic S gene target failure behavior used to track this variant. Full genome sequencing from four independent extracts reveals that Alpha variant alleles comprise around 75% of the genomes, whereas the Epsilon variant alleles comprise around 20% of the sample. Further investigation reveals the presence of numerous recombinant haplotypes spanning the genome, specifically in the spike, nucleocapsid, and ORF 8 coding regions. These findings support the potential for recombination to reshape SARS-CoV-2 genetic diversity.
Recombination is a common evolutionary feature of positive-strand RNA viruses[1]. It can increase genetic diversity and accelerate adaptation in viral populations by combining existing linked allelic variation. The signature of frequent recombination is pervasive across Betacoronaviruses in bats and other animal hosts[2-6], and its detection is made easier in part by the substantial genetic divergence separating these various coronaviruses. When an individual is contemporaneously infected with two genetically distinct strains of a virus, so-called superinfection[7], these viruses can recombine to produce a virus with novel allelic combinations. Although recombination is expected to regularly occur during SARS-CoV-2 infections, it can be difficult to detect in vivo unless it involves genetically distinguishable parental strains: recombination between two identical or nearly identical genomes leaves no detectable molecular trace. Furthermore, as with influenza virus[8,9], SARS-CoV-2 superinfections have been only rarely reported in humans[10-12], likely due to the short mean duration of SARS-CoV-2 infections.Early claims of recombination in SARS-CoV-2[13] may have been confounded by sequencing errors or convergent evolution[14]. As the COVID-19 pandemic progressed, and genetically divergent lineages have evolved, evidence for recombination between these lineages is becoming more convincing[15,16]. Many of these divergent lineages include highly transmissible variants distinguished by S (spike) genes that, under positive selection, have accumulated multiple mutations associated with increased transmissibility, virulence, and immune escape[17,18]. All of these variants are descended from the B.1 lineage, defined by four mutations: C241T, C3037T, C14408T, and A23403G. This lineage arose in China and spread to Europe in late-January 2020 and the United States in February 2020[19].Here, we provide a detailed characterization of an instance of superinfection from January 2021, identified by the New York City (NYC) Department of Health and Mental Hygiene (DOHMH). We show that this individual was superinfected with two SARS-CoV-2 variants: Alpha (B.1.1.7) and Epsilon (B.1.429)[20,21]. Further, we characterize evidence for recombination occurring within this superinfected individual, providing an in vivo snapshot of this evolutionary process within SARS-CoV-2.
Results
Index case and named contact partner epidemiology
In December 2020, researchers and public health officials in the United Kingdom identified a rapidly spreading SARS-CoV-2 variant within England, then designated as PANGO lineage B.1.1.7[21], now designated as the Alpha variant of concern in the WHO nomenclature. In NYC, a SARS-CoV-2 genome sequence classified as belonging to the Alpha lineage was obtained from a sample on 4 January 2021 (the ‘index case’): NYCPHL-002130 (GISAID accession number EPI_ISL_857200). Due to the potential public health importance of Alpha variant cases in NYC in early 2021, NYC DOHMH conducted a public health investigation related to the individual from which this sample had been obtained. This investigation determined that the individual had recently traveled to Ghana (late December/early January) and developed symptoms consistent with COVID-19 while in Ghana. Contact tracing in New York City identified another case of an Alpha variant infection, sampled on 14 January 2021, in a named contact with a similar travel history (the ‘named contact partner’): NYCPHL-002461 (GISAID accession EPI_ISL_883324). The named contact partner had also developed symptoms consistent with COVID-19 while in Ghana, prior to returning the United States.
Alpha (B.1.1.7) variant PCR screening
Typical of the Alpha variant[21], NYCPHL-002130 from the index case exhibited S gene target failure (SGTF) phenotype with the TaqPath COVID-19 RT-PCR assay (Table 1). NYC PHL uses the ARTIC amplicon-based protocol V3 to sequence full viral genomes and capture intra-host diversity. All 24 mutations diagnostic of the Alpha variant were found in >90% of reads (Table 2). The viral genome from this index case showed limited intra-host viral diversity (Fig. 1). A single variable site was found at position 23099, with C in 20.4% of reads and A in 79.6% of reads.
Table 1
Cycle threshold (Ct) values from TaqPath assays from index case (NYCPHL-002130) and named contact partner (NYCPHL-002461).
Case
WGS ID
Ct value
ORF1ab
N gene
S gene
Index case
NYCPHL-002130
14.97
15.44
N/A
Named contact partner
NYCPHL-002461-A
14.86
15.70
17.34
NYCPHL-002461-B
16.06
16.251
18.68
NYCPHL-002461-C
N/A
N/A
N/A
NYCPHL-002461-D
15.73
15.83
18.35
Table 2
Lineage defining mutations and their frequency in the index case (NYCPHL-002130) and named contact partner (NYCPHL-002461).
Lineage
Defining mutations
Frequency
NYCPHL-002130
NYCPHL-002461
B.1
C241T
0.940
0.916
C3037T
0.939
0.934
C14408T
0.966
0.931
A23403G
0.949
0.945
Alpha
C913T
0.946
0.736
C3267T
0.973
0.755
A5388C
0.949
0.761
C5986T
0.983
0.771
T6954C
0.977
0.787
ORF1a: Δ3675-3677
0.983
0.806
C14676T
0.945
0.625
C15279T
1.000
0.772
T16176C
0.985
0.785
S: Δ69-70
1.000
0.983
S: Δ144
0.990
0.978
A23063T
0.944
0.776
C23271A
0.968
0.737
C23604A
0.902
0.711
C23709T
0.939
0.780
T24506G
0.971
0.766
G24914C
0.947
0.751
C27972T
0.978
0.706
G28048T
0.982
0.723
A28111G
0.978
0.718
G28280C
0.945
0.726
A28281T
0.963
0.727
T28282A
0.963
0.736
C28977T
0.962
0.704
Epsilon
C1059T
0.000
0.214
A12878G
0.000
0.172
G17014T
0.000
0.169
G21600T
0.000
0.170
G22018T
0.000
0.124
T22917G
0.000
0.181
T24349C
0.000
0.164
G25563T
0.000
0.178
C26681T
0.000
0.182
G27890T
0.000
0.235
A28272T
0.000
0.000
C28887T
0.000
0.227
Δ in-frame codon deletion.
Fig. 1
The distribution of allelic frequencies in the index case (NYCPHL-002130) and named partner with suspected superinfection (NYCPHL-002461).
Frequencies of individual alleles shown as ticks, a smoothed kernel density plot is used to highlight clustering patterns, and colors represent allele types.
Cycle threshold (Ct) values from TaqPath assays from index case (NYCPHL-002130) and named contact partner (NYCPHL-002461).Lineage defining mutations and their frequency in the index case (NYCPHL-002130) and named contact partner (NYCPHL-002461).Δ in-frame codon deletion.
The distribution of allelic frequencies in the index case (NYCPHL-002130) and named partner with suspected superinfection (NYCPHL-002461).
Frequencies of individual alleles shown as ticks, a smoothed kernel density plot is used to highlight clustering patterns, and colors represent allele types.
Atypical Alpha (B.1.1.7) variant PCR screening
During the initial PCR screening of the sample collected from the named contact partner (NYCPHL-002461-A), the SGTF characteristic of the Alpha variant was not observed (Table 1). Furthermore, genome sequencing revealed substantial intra-host viral diversity within the viral genome, a possible signature of superinfection (Fig. 1). To confirm that this intra-host diversity was not attributable to experimental or sequencing artifacts, the original sample was re-extracted and re-sequenced (NYCPHL-002461-B) and similar SGTF was observed. Additional extractions were then performed in duplicate from the original stock (NYCPHL-002461-C and -D) and sequenced. The same signature of intra-host diversity was confirmed in all four sequenced extractions. Four nucleotide (nt) substitutions differentiating this sequence from the reference genome were identified at >90% frequency: C241T, C3037T, C14408T, and A23403G (Fig. 1; Table 2). These four substitutions were all present in the lineage B.1 virus that is ancestral to the named SARS-CoV-2 variants. Numerous additional substitutions, including A23063T (S N501Y), were present, but at slightly lower frequencies. Nonetheless, this genome was classified as an Alpha variant. Notably, the Δ69/70 and Δ144 deletions were found at >97% in the sequencing reads, despite the lack of SGTF.NYCPHL-002461-A, -B, and -D extracts exhibited low Ct values for the ORF1ab and N gene targets, ranging between 15 and 16 (Table 1). The S gene target Ct values were around 2 to 3 cycles higher. The difference suggests a reduction of viral template in the S gene target region, but not SGTF. We note NYCPHL-002461-C yielded an invalid result, as the TaqPath assay showed no amplification on all targets, including the MS2 phage extraction-control target.
Intra-host diversity
The presence of multiple intermediate frequency alleles and the lack of SGTF in the TaqPath assay prompted us to investigate the intra-host diversity in the named contact partner, NYCPHL-002461. Using the previously described and validated Galaxy SARS-CoV-2 allelic variation pipeline[22], we identified four categories of allelic frequencies: shared, major strain, minor strain, and other (see Fig. 1, interactive notebook at https://observablehq.com/@spond/nyc-superinfection). The four replicate sequencing runs for NYCPHL-002461 yielded remarkably similar patterns of these allelic frequencies.Alleles that fell into the shared category were present at 90% allele frequency in three or more samples. Shared alleles included all four substitutions characteristic of B.1 (Table 2) and two deletions in the S gene (Δ69-70 and Δ144) diagnostic of the Alpha variant.Major strain occurred at frequencies between 60 and 80% (in at least 3 samples). Major alleles included all 21 substitutions defining the Alpha variant, which we observed at a median allele frequency of 74.1%, and ORF1A deletion (Table 2). The remaining major alleles are shared with genome from the index case.Minor strain alleles occurred at frequencies between 10 and 25% (in at least 3 samples). All but one of the 12 diagnostic Epsilon mutations was found in this set: A28272T is absent in NYCPHL-002461. All remaining minor alleles have been observed in other Epsilon genomes.The “other” category encompasses all other variable sites, i.e. those occurring at frequency between 25 and 60% or those found in only one or two samples. The two alleles were found in all four replicate sequences at intermediate frequencies: G7723A (30.3%) and C23099A (46.7%). These frequencies are suggestive of intra-host variation in the major strain.In contrast to the allelic mixture detected in the named partner (NYCPHL-002461), we observed allele frequencies >90% for all Alpha defining mutations in the sequencing data for the index case, NYCPHL-002130 (Table 2). The C23099A mutation, which was at intermediate frequency in NYCPHL-002461 from the named contact partner, was present at 88.1% in NYCPHL-002130 from the index case, consistent with the transmission of a mixed viral population between these individuals.
Phylogenetic inference with major and minor variants
We identified sub-clades within Alpha and Epsilon that shared substitutions with the major and minor strains (Fig. 2). We inferred a maximum likelihood (ML) phylogenetic tree in IQTree2 for the major strain and 3655 related Alpha (B.1.1.7) genomes containing the C2110T, C14120T, C19390T, and T7984C substitutions found in the major strain (Fig. 2A). We also inferred an ML tree for the minor strain and 2275 related Epsilon (B.1.429) genomes containing the C8947T, C12100T, and C10641T substitutions found in the minor strain (Fig. 2C).
Fig. 2
Phylogenetic consistency of major and minor variants.
A Phylogeny of Alpha variant immediate relatives. B Root-to-tip regression for Alpha variant. C Phylogeny of Epsilon variant immediate relatives. D Root-to-tip regression for Epsilon variant. NY-NYCPHL-002461 is the genome deposited in GISAID from the case of putative superinfection. NY-NYCPHL-002130 is the genome from the index case.
Phylogenetic consistency of major and minor variants.
A Phylogeny of Alpha variant immediate relatives. B Root-to-tip regression for Alpha variant. C Phylogeny of Epsilon variant immediate relatives. D Root-to-tip regression for Epsilon variant. NY-NYCPHL-002461 is the genome deposited in GISAID from the case of putative superinfection. NY-NYCPHL-002130 is the genome from the index case.Root-to-tip regression analyses show that the NYCPHL-002461 sampling date is consistent with the molecular clock for both the major and minor strain sequences (Fig. 2B/D), indicating that one would expect viruses of this degree of genetic divergence to have been circulating in mid-January 2021. In fact, genomes identical to the major variant were sampled in both NYC (the NYCPHL-002130 index case) and in Ghana on 8 January 2021 (EPI_ISL_944711), consistent with a scenario in which this particular Alpha virus was acquired in Ghana. These three viruses share a common ancestor around 4 January 2021 and are separated from additional viruses sampled in Ghana by two mutations: C912T and C23099A. Notably, the latter mutation appears at intermediate frequency in both NYCPHL-002130 and NYCPHL-002461.The minor variant is genetically distinct from all other sampled genomes, including any genome sequenced by NYC DOHMH (Fig. 2C). The closest relatives were sampled in California (EPI_ISL_3316023, EPI_ILS_1254173, EPI_ISL_2825578), the United Kingdom (EPI_ILS_873881), and Cameroon (EPI_ISL_1790107, EPI_ISL_1790108, EPI_ISL_1790109). The most similar of these relatives is EPI_ISL_3316023, which was sampled on 11 January 2021 in California and represents the direct ancestor of the minor variant on the phylogeny. The only mutation separating this California genome from the minor variant is T28272A, which is a reversion away from an Epsilon-defining mutation (Table 2).It is unlikely that this minor variant is a laboratory contaminant, as there are no closely related Epsilon genomes sequenced from NYC. That said, NYC represents the probable source of this Epsilon virus. Of the 145 SARS-CoV-2 genomes sequenced by NYC public health surveillance between 10 January 2020 through 16 January 2020, 4 (2.8%) were Epsilon. A similar proportion of Epsilon genomes deposited in GISAID were sampled by other labs during this same period: 11 out of 431 genomes (2.6%)[23]. No Epsilon genome has been reported to date from Ghana.
Four-gamete tests of recombination
A preliminary inquiry of the genome sequencing data from the S gene (12 contiguous read fragments) and N gene (nucleoprotein; 3 contiguous read fragments) regions was suggestive of recombinant genome fragments within the named contact partner. To determine whether pairs of polymorphic sites within individual read fragments displayed evidence of recombination we employed three different four-gamete based recombination detection tests: PHI[24], MCL, and R2 vs Dist[25] (Table 3). The power of each of these tests to detect recombination was seriously constrained by the short lengths of the read fragments and the low numbers of both variant-defining sites and other polymorphic sites with minor allele frequencies >1% within each of the fragments. Only three of the 15 read fragments (read fragments 6 and 8 in the S gene and read fragment 3 in the N-gene) encompassed two or more of the variant-defining sites that were expected to provide the best opportunities to detect recombination. Nevertheless, pairs of sites within four read fragments in the S gene (positions 23123–24467 covering fragments 7, 8, 9 and 10) and one read fragment in the nucleoprotein gene (positions 28986–29378 covering fragment 3) exhibited signals of significant phylogenetic incompatibility with at least two of the three tests (p < 0.05): signals which are consistent with recombination. The only read fragment for which evidence of recombination was supported by all three tests was fragment 3 in the N gene: a fragment that was one among only three that contained multiple variant-defining substitutions. Eight of the fifteen analyzed read-fragment alignments exhibited no signals of recombination using any of the tests, which is unsurprising given the lack within these fragments of both variant-defining substitutions and polymorphic sites with minor allele frequencies greater than 1%.
Table 3
Four gamete recombination test results for 15 sets of aligned read fragments in the S (spike) and N (nucleoprotein) genes.
Gene
Fragment
Start1
End1
Recombination test p-value
PHI
MCL
R2 vs Dist
S
1
21354
21730
>0.9
>0.9
>0.9
2
21658
22038
>0.9
>0.9
>0.9
3
21962
22346
>0.9
>0.9
>0.9
4
22263
22650
0.682
0.393
0.132
5
22517
22903
>0.9
0.401
0.142
6
22798
23212
<0.001
0.355
0.3
7
23123
23522
>0.9
<0.001
0.005
8
23444
23847
<0.001
0.317
0.008
9
23790
24169
>0.9
0.047
0.03
10
24079
24467
>0.9
0.003
<0.001
11
24392
24789
<0.001
0.071
0.477
12
24697
25076
0.741
0.155
0.208
N
1
28395
28779
0.491
0.482
0.459
2
28678
29063
>0.9
0.868
0.628
3
28986
29378
<0.001
<0.001
<0.001
1Position relative to the Wuhan-Hu-1 reference strain.
Four gamete recombination test results for 15 sets of aligned read fragments in the S (spike) and N (nucleoprotein) genes.1Position relative to the Wuhan-Hu-1 reference strain.
Targeted sequencing for recombination detection
The four gamete tests on genomic sequencing data is limited by the short length of amplified fragments. To obtain data from longer sequence fragments, we PCR-amplified three regions of the genome from the original nucleic acid extracts, cloned them, and then sequenced individual clones. These longer genomic fragments provide greater resolution for detecting recombination, compared with the short fragments from deep sequencing analysis, because they include more differentiating sites spread out farther across the genome.The longest cloned region spanned 947 nt within the S gene (positions 22904–23850) and contained 5 nt substitutions differentiating the major and minor strains plus a variable site in the major variant. Of the 104 clones sequenced within this region, 60 (57.7%) were major strain haplotypes, 13 (12.5%) were minor strain haplotypes, whereas the remaining 31 clones (29.8%) contained both major and minor strain mutations, consistent with recombination (Fig. 3). We observed 11 distinct combinations of major and minor strain mutations across these clones, with two distinct haplotypes present in 6 clones apiece. Most recombinant haplotypes (n = 24) are consistent with only a single recombination breakpoint. However, 7 clones are consistent with 2 breakpoints (representing 3 different haplotypes), and 1 clone is consistent with 3 distinct breakpoints.
Fig. 3
Major, minor, and mixed haplotypes in the 947 nucleotide S (spike) gene cloned sequences.
Each row represents a sequenced clone (n = 104). Colored markings denote mutations from the reference genome. Major strain mutations are those found in the Alpha variant. Minor strain mutations are those found in Epsilon variant. Other mutations are found at intermediate or low frequencies. Shared mutations are those shared by B.1 viruses.
Major, minor, and mixed haplotypes in the 947 nucleotide S (spike) gene cloned sequences.
Each row represents a sequenced clone (n = 104). Colored markings denote mutations from the reference genome. Major strain mutations are those found in the Alpha variant. Minor strain mutations are those found in Epsilon variant. Other mutations are found at intermediate or low frequencies. Shared mutations are those shared by B.1 viruses.The second cloned S region spanned 657 nt in the S gene (positions 21442–22098) including the Δ69–70 and Δ144 deletions characteristic of the major strain and two 2 substitutions in the minor strain. Of the 93 clones sequenced, 69 (74.1%) were major strain haplotypes, 17 (18.3%) were minor strain haplotypes, and 7 (7.5%) were mixed haplotypes (Fig. 4). Five of these mixed haplotypes contained only one of the two deletions. One mixed haplotype was consistent with multiple recombination breakpoints. Unlike in the primary sequencing analyses where the Δ69–70 and Δ144 deletions were present in >98% of sequences, Δ69-70 was observed in only 72 (77.4%) clones and Δ144 was observed in only 71 (76.3%). These frequencies are consistent with the frequency of the other major strain substitutions in the primary sequencing analysis.
Fig. 4
Major, minor, and mixed haplotypes in the 657 nucleotide S (spike) gene cloned sequences.
Each row represents a sequenced clone (n = 93). Colored markings denote mutations from the reference genome. Major strain mutations are those found in the Alpha variant. Minor strain mutations are those found in Epsilon variant. Other mutations are found at intermediate or low frequencies.
Major, minor, and mixed haplotypes in the 657 nucleotide S (spike) gene cloned sequences.
Each row represents a sequenced clone (n = 93). Colored markings denote mutations from the reference genome. Major strain mutations are those found in the Alpha variant. Minor strain mutations are those found in Epsilon variant. Other mutations are found at intermediate or low frequencies.The third, and shortest, cloned region spanned 476 nt of ORF8 (positions 27798–28273), surrounding 4 substitutions defining the major strain and 1 minor strain substitution. Of the 36 cloned sequences, 30 (83.3%) had the major strain haplotype, 2 (5.6%) had the minor variant haplotype, and 4 (11.1%) had mixed haplotypes consistent with a single recombination breakpoint (Fig. 5). Note the discriminating substitutions only span 223 nt of this region.
Fig. 5
Major, minor, and mixed haplotypes in the 476 nucleotide ORF8 cloned sequences.
Each row represents a sequenced clone (n = 36). Colored markings denote mutations from the reference genome. Major strain mutations are those found in the Alpha variant. Minor strain mutations are those found in Epsilon variant. Other mutations are found at intermediate or low frequencies.
Major, minor, and mixed haplotypes in the 476 nucleotide ORF8 cloned sequences.
Each row represents a sequenced clone (n = 36). Colored markings denote mutations from the reference genome. Major strain mutations are those found in the Alpha variant. Minor strain mutations are those found in Epsilon variant. Other mutations are found at intermediate or low frequencies.Three cloned sequences from the 947 nt S gene fragment contained single nucleotide deletions resulting in non-sense mutations. In the 657 nt S gene fragment, we observed 8 clones with similar deletions, detected in both the forward and reverse direction during sequencing. These deletions were seen in the non-recombinant Alpha and Epsilon haplotypes and likely reflect non-functional viral particles, expected to constitute a substantial fraction of genomes within an infected individual[26,27].
Consistency between cloning and genome sequencing analyses
In vitro recombination can be introduced by reverse-transcription and PCR amplification, which are part of both genome sequencing and cloning protocols[28]. These in vitro effects have a strong stochastic component and would result in substantially different recombinant haplotype frequencies across different extracts and PCR experiments. To determine the extent to which these protocols could have led to biased inference of recombination, we compared the haplotype frequencies across the four extracts from NYCPHL-002461, which had each independently been subjected to reverse transcription and PCR amplification, and the frequency of these haplotypes in the cloning experiment, which included PCR amplification.Within the 947 nt cloned S gene fragment, the major haplotype was present between 76.4% and 78.6%, and the minor haplotype was between 13.7% and 15.4% (Supplementary Table 1). The recombinant haplotype positions 23604 A and 23709 C was present at 3.9% allele frequency (standard deviation of 0.34% across extracts), whereas recombinant haplotype 23604 C and 23709 T was present at 4.3% (standard deviation of 0.37% across extracts). Although the haplotype frequencies among extracts were significantly different (p = 0.029; chi-square test), the magnitude of these differences were unremarkable. Furthermore, there was no significant difference between the frequency of these haplotypes in cloning experiment and extracts (p = 0.190 versus -A; p = 0.189 versus -B; p = 0.357 versus -C; p = 0.206 versus -D; Fisher’s Exact Test).A similar pattern was observed within the 476 nt cloned fragment in the ORF8 region, which included four discrimination sites: 27972, 28048, 28095, and 28111 (Supplementary Table 2). The predominant recombinant haplotypes were consistent across the four extracts, and the frequencies differed only slightly (p = 0.077; chi-square test). As in S, the frequency of these recombinant haplotypes in the cloning experiment was not significantly different from any of the extracts (p = 0.405 versus -A; p = 0.413 versus -B; p = 0.199 versus -C; p = 0.408 versus -D; Fisher’s exact test).Hence, in vitro recombination induced by either reverse-transcription or PCR amplification, does not appear to have been the dominant contributor to the recombinant haplotype distribution reported here.
Search for transmission of a circulating recombinant
To determine whether there was onward transmission of a recombinant descendent of these major and minor strains, we queried the 27,806 genomes sequenced by NYC public health surveillance and deposited to GISAID through 5 September 2021. We tested these genomes for mosaicism (3SEQ[29]; with Dunn-Sidak correction for multiple comparisons) of the major and minor strains; however, we were unable to reject the null hypothesis of non-reticulate evolution for any of these genomes. We also did not find any genomes in the PHL dataset with a superset of the identifying substitutions present in the major and minor variants (e.g., C912T and C27406G) among the genomes in the PHL dataset. There is no evidence of an Alpha/Epsilon recombinant that circulated in New York City.Since the Dunn-Sidak correction done in the 3SEQ analysis applies a conservative type-1 error threshold of 0.05, we reran the analysis using a more permissive threshold of 0.25 (see methods) and were able to reject the null hypothesis for a single genome (EPI_ISL_2965250; p = 2.24×10−6 and Dunn-Sidak corrected p = 0.117). Although this genome (Fig. 6) contains many of the mutations characteristic of the Alpha variant throughout the genome, it does not possess mutations unique to the major strain nor any Epsilon-specific mutations. Rather, within the putative recombinant regions, the EPI_ISL_2965250 genome has C8809T, C27925T, C28311T, and T28879G. All of these mutations are characteristic of the B.1.526 Iota-variant, prevalent in NYC in early 2021. Therefore, this genome is likely not a descendant of the major and minor strains. Instead it appears to be a recombinant descendant of Alpha and Iota viruses.
Fig. 6
Putative Alpha/Iota variant recombinant and the nucleotide variation present in the major, minor, and reference strains.
The distribution of the nucleotide variation found in the major, minor, Iota (B.1.526; EPI_ISL_1635735), and single putative recombinant (EPI_ISL_2965250) strains relative to the reference genome (Wuhan Hu-1; bottom gray sequence).
Putative Alpha/Iota variant recombinant and the nucleotide variation present in the major, minor, and reference strains.
The distribution of the nucleotide variation found in the major, minor, Iota (B.1.526; EPI_ISL_1635735), and single putative recombinant (EPI_ISL_2965250) strains relative to the reference genome (Wuhan Hu-1; bottom gray sequence).
Discussion
Here, we report evidence of intra-host recombination of SARS-CoV-2 within a single individual superinfected with Alpha and Epsilon viral variants during the second COVID-19 wave in New York City in early 2021. Because recombinant viruses can be successfully generated and transmitted[15] between humans, this finding underscores their potential relevance to the future of the COVID-19 pandemic.The presence of major and minor strains described within the superinfected individual are unlikely to be the result of bioinformatics error, contamination, or experimental artifacts. The degree of evolutionary divergence of each of the strains from other available SARS-CoV-2 genomes is consistent with other naturally co-circulating occurring viruses in mid-January 2021. Moreover, the major strain genome is identical to contemporaneously sampled genomes from both a named contact and strains circulating in the country from which they had both recently visited. No closely related genome to the minor strain was ever sequenced by NYC DOHMH, lessening the probability of a contaminated sample. Given the relatively low sequencing coverage in NYC in January 2021 and low prevalence of the Epsilon variant, around 2–3% in NYC at the time, it is not unexpected that a closely related genome would not be observed. Furthermore, the major and minor variants were both present in all four extractions of the two aliquots at similar frequencies, indicating that any contamination, if present, would need to have occurred in the original sample swab.The timing of this superinfection is important, because January 2021 was the peak of the second COVID-19 wave in NYC, a time when numerous variants were circulating and immediately prior to the vaccination roll-out campaign. Accordingly, January 2021 in NYC represents not only the height of potential for superinfection risk up to that point in the pandemic, but also a location where its existence would be most apparent due to the co-circulation of numerous genetically distinct viral variants.There remain unexplained patterns in the genome sequencing data from the named contact partner. Evidence of a major and minor strain was not apparent at the S deletions Δ69/70 and Δ144 in the genome sequencing, but the cloning analysis showed major and minor alleles at these sites at the expected frequencies. Therefore, it is possible that the ARTIC protocol preferentially sequenced templates containing these deletions, giving a false impression of their predominance in the genomic analysis. Also of interest is the A28272T mutation in the minor strain, which is either a reversion or sequencing artifact. If the base-call at position 28272 in the minor variant is erroneous, then the minor strain would be identical to a virus sampled contemporaneously in California, where the Epsilon variant was first discovered and likely originated[20].Laboratory induced recombination is a common artifact during reverse-transcription and PCR[30,31]. However, recombination is a pervasive feature of natural coronavirus infection, as it has been observed in bats, camels, and humans[2,15,32,33]. One would a priori expect to find recombinant viruses in a SARS-CoV-2 superinfected individual. Therefore, it is unlikely that the entirety of the signal for recombination reported here is due to reverse-transcription or PCR-induced recombination. A consistent signal for recombination was observed in the four whole genome sequencing analyses and in cloned-fragment analysis, all suggesting the same recombinant haplotypes present at high frequency.Our search for Alpha/Epsilon variant recombinants in NYC did not identify genomes that would suggest onward transmission of either of the major or minor strains derived here, or a recombinant offspring. This lack of onward transmission is not surprising, given that the initial index case was contacted by NYC DOHMH personnel and the named contact partner received a prompt COVID-19 diagnosis and was advised to self-isolate.It is likely that superinfection with SARS-CoV-2 is more common than has been described in the literature, especially given the documentation of circulating recombinant strains of the Alpha variant in the United Kingdom[15]. Recombinant virus can only be produced within an individual with multiple contemporaneous infections. That said, we caution against assuming superinfection before potential issues of contamination, poor-quality sequencing, or bioinformatics errors have been appropriately dealt with. Ideally, multiple specimens should be collected from the same the individual to enable validation of signal for intra-host variation, though feasibility of collecting multiple swabs may be a challenge. Additionally, evidence for recombination is most robust when multiple individuals with the same signal for recombination are identified[15].The high number and genomic variability recombinant haplotypes that we have identified within a single superinfected individual suggests that recombination is perpetually occurring within SARS-CoV-2 infections. Whether recombination will play a role in the emergence of novel SARS-CoV-2 variants is an open question. Many SARS-CoV-2 variants tend to possess the same mutations, although consistent with recombination, are more likely the product of convergent evolution[18]. In contrast, the BA.3 lineage within the Omicron variant is likely of recombinant offspring of BA.1 and BA.2[34]. However, the most probable mechanism driving the origin of variants like Alpha and Omicron remains host adaptation within chronically infected individuals[35,36]. Reduced incidence due to vaccine-induced and naturally-acquired immunity would lower the opportunity for superinfection, and the homogenizing effect of variant-driven selective sweeps (as seen in the Delta and Omicron variants[37]) will lessen the potential for biological innovation in a recombinant genome. Nonetheless, SARS-CoV-2 molecular surveillance should actively monitor for the emergence of a recombinant variant (e.g., within Omicron and Delta/Omicron recombinants[38-40]).
Methods
The UC San Diego IRB granted approval of this work as analysis of Human Subjects exempt surveillance data; SARS-CoV-2 public health surveillance data are collected without requiring individual consent.
Extraction and sequencing
Nasopharyngeal specimens positive for SARS-CoV-2 with Ct < 32 were submitted to NYC PHL for sequence analysis by the NYC DOHMH through the COVID Express clinics. The NYCPHL-002130 and -002461 specimens had Ct values of 19 and 20 cycles, respectively, which allowed for sequencing at NYC PHL. Each specimen was split into separate extraction and archive aliquots. Nucleic acid extraction was performed using the KingFisher Flex Purification System (Thermo Fisher Scientific) from the extraction aliquot. Eleven µL of extract was used to anneal with random hexamers and dNTPs (New England Biolabs Inc., NEB) and reverse transcribed with SuperScript IV Reverse Transcriptase at 42 °C for 50 min. The cDNA product was amplified in two separate multiplex PCRs with ARTIC V3 primer pools (Integrated DNA Technologies) in the presence of Q5 2x Hot Start Master Mix (NEB) at 98 °C for 30 s, and 35 cycles of 98 °C for 15 s and 65 °C for 5 min. The two PCR products were combined and were purified with Agencourt Ampure XP magnetic beads (Beckman Coulter) at a 1:1 sample-to-bead ratio. The bead-cleaned PCR products were quantified using a Qubit 3.0 fluorometer (Thermo Fisher Scientific). Standard protocol was used for library preparation in the NEBNext Ultra II Library Preparation workflow using 90 ng of PCR product (NEB). In short, the ARTIC PCR products were used in an end-repair reaction, which added a 5’-phosphate group and a dA-tail, in a reaction for 30 min at 20 °C. The reaction was heat inactivated for 30 min at 65 °C. NEBNext Adaptor was ligated at 25 °C for 30 min and cleaved by USER Enzyme at 37 °C for 15 min. The product was Agencourt Ampure XP bead-purified at a ratio of 0.6x sample:beads. The bead-cleaned, end-ligated amplicons were subjected to a 6-cycle PCR reaction with NEBNext Ultra II Q5 Master Mix in the presence of NEBNext Multiplex Oligos for Illumina (NEB), which added a sample-specific 8-base index and Illumina P5 and P7 adapters for sequencing on Illumina instruments. The product was purified with Ampure XP beads at a 0.6x sample:bead ratio and quantified, normalized and pooled at equimolar concentration with other libraries, followed by loading onto the Illumina MiSeq sequencing instrument using V3 600-cycle reagent kit, with a V3 flow cell for 250-cycle paired-end sequencing (Illumina). For NYCPHL-002461, the same “extraction” specimen aliquot was used for a second extraction, Extract B. Extracts C, and D were independent extractions, but from the “archived” specimen aliquot. As such, the first extract (A), and extracts B, C, and D were independent samples which underwent independent reverse transcription, ARTIC PCR, library preparation, and sequencing reactions. Genomic sequencing depth was similar across all extracts for both the index case and named contact partner (Supplementary Figure 1).Potential in vitro recombination that occurred during the four independent extractions, reverse-transcription reactions, and library preparation procedures, such as PCR amplification, would require the events to occur independently at the same stage four times in order to produce the same proportions of major and minor variant haplotypes in the high-throughput sequencing data. To account for in vitro recombination, regions where long complete reads span across major and minor variants in close proximity, <105 nucleotide bases, were examined across all genome alignments of the four NYCPHL-002461 replicates. Reads with SAM (Sequence Alignment Map) Flags 81,83,97,113,145,147,161,177,2129 are included in the analysis. Reads with other flags are excluded from the analysis or are not found in the alignment files. Additionally, reads without a combination of major or minor alleles are excluded. All unique haplotypes at major and minor variant positions are grouped together and quantified. Relative frequencies of the haplotypes are calculated for each region of all four extractions.
Cloning
Three regions of the SARS-CoV-2 genome from NYCPHL-002461 were cloned. Two contained non-overlapping regions of the S gene: a 947 nt fragment (positions 22904–23850) and 657 nt fragment (positions 21421–22098). The third region included 476 nt fragment of ORF8 (positions 27798–28273).To perform the annealing step for reverse transcription, 3 µl of the NYCPHL-002461 nucleic acid extract was combined with reverse primer and dNTPs at final concentrations of 154 nM and 769 µM, respectively. Annealing was performed by heating to 65 °C for 5 minutes then cooling to 4 °C. The reverse primer sequences used for the reverse transcription are as follows: 947 nt S fragment primer-R is 5ʹ-CTATTCCAGTTAAAGCACGGTTT, 657 nt S fragment is 5ʹ-AGGTCCATAAGAAAAGGCTGAGA and ORF8 476 nt fragment is 5ʹ-GAGACATTTAGTTTGTTCGTTTA. An elongation mix containing 200 units of SuperScript IV Reverse Transcriptase (Invitrogen) and 40 units of RNaseOUT Recombinant Ribonuclease Inhibitor (Invitrogen) was then added along with dNTPs at a final concentration of 200 µM. The resulting solution was heated to 55 °C for 10 minutes then 80 °C for 10 minutes.Each cDNA target was PCR amplified using Platinum II Taq polymerase and its accompanying buffer (Invitrogen) supplemented with a final concentration of 200 µM dNTPs and 600 nM of each primer. The reverse primer sequences used for reverse transcription were included along with their corresponding forward primers: 947 nt S fragment primer-F is 5ʹ-TCTTGATTCTAAGGTTGGTGGT, 657 nt S fragment is 5ʹ-AGGGGTACTGCTGTTATGTCT and ORF8 is 5ʹ-GCCTTTCTGCTATTCCTTGT. PCR was performed with the following cycling conditions: an initial hold at 94 °C for 2 minutes, followed by 35 cycles at 94 °C for 15 seconds, 60 °C for 15 seconds and 68 °C for 15 seconds. Cycling was followed by a final extension step at 72 °C for 7 minutes. The PCR products were run on 2% agarose gels, excised and gel purified with the GenElute Gel Extraction Kit (Sigma). The gel-purified PCR products were cloned using the TOPO TA Cloning Kit for Sequencing (Invitrogen).Individual colonies resulting from the transformation into chemically competent One Shot TOP10 Escherichia coli (Invitrogen) were picked and patch plated for sequencing. Rolling circle amplification and Sanger sequencing of the clones were performed by GeneWiz (New Jersey). All clones were sequenced with the M13 reverse primer. Due to its larger size, the 947 nt S fragment clones were also sequenced with the M13 forward primer.
Major and minor variant calling
We used the Galaxy SARS-CoV-2 variant calling pipeline for paired-end Illumina ARTIC amplicon data[22]. Briefly, the workflow performs quality control, masks primer sites, maps reads to reference using BWA-mem, calls variants using lofreq, annotates them using SNPEff, and outputs tabular variant call files, thresholded on minimum allele frequency of 0.05. These variants are further visualized in a custom ObservableHQ notebook (https://observablehq.com/@spond/nyc-superinfection). For Table 2, we explored the data for diagnostic alleles with frequency below 0.05, but found none.
Alignment and phylogenetic inference
SARS-CoV-2 genomes were downloaded from GISAID on 5 September 2021[41]. Genomes assigned to the B.1.1.7 or B.1.429 Pangolin lineage[42] were aligned to the Wuhan-Hu-1 reference genome using the --6merpair option in Mafft v7.464[43]. We further refined these alignments to include only those genomes sharing specific synapomorphies with the major and minor allelic variants from NYCPHL-002461: C2110T, C14120T, C19390T, and T7984C for B.1.1.7 viruses and C8947T, C12110T, and C10641T for B.1.429 viruses. Separate maximum likelihood phylogenetic trees for B.1.1.7 (n = 3655) and B.1.429 (n = 2275) were inferred in IQTree2 v2.1.3 under a GTR + F + I model, with the additional NNI search option and a minimum branch length of 1e-9 substitutions/site[44].
Molecular clock inference
To determine whether the major and minor allelic variants were contemporaneous with the date of sampling, we estimated clock trees for the B.1.1.7 and B.1.429 phylogenies in TreeTime v0.8.0[45]. We fixed the clock rate to 8 × 10−4 substitutions/site/year under a skyline coalescent model (per NextStrain default parameters for SARS-CoV-2). These trees were also used to infer the time to most recent common ancestor of the allelic variants and their closest relatives.
Four-gamete tests for recombination
When sequences evolve in the absence of both recombination and convergent mutations it is expected that, for any pair of polymorphic sites where one of the sites has either nucleotide X or Y and the other has either nucleotide A or B, no more than three of the four possible combinations of nucleotides (or gametes) at the two sites (i.e. XA, XB, YA and YB) should ever be observed. Given that in reality convergent mutations are always possible, four gamete tests of recombination attempt to detect situations where the numbers of site pairs where all four combinations of nucleotides are observed exceed that expected due to convergent mutations in the absence of recombination. We tested 15 multiple sequence alignments, each containing all observed unique read fragment sequences spanning the S-gene (12 fragments) and N-gene (3 fragments) with three different four gamete tests: (1) the PHI test (implemented in RDP5[24,41]) which considers sites with more than two alternative nucleotide states and uses a permutation-based test to determine whether detected site pairs displaying all four gametes display a degree of spatial clustering along the sequence that is significantly higher than would be expected in the absence of recombination; (2) the MCL recombination detection test (implemented in the pairwise component of the LDHat package[25]) which uses an approximate maximum likelihood method to infer the population scaled recombination rate needed to explain the observed numbers of site pairs with four gametes and then tests for significant deviation of the inferred recombination rate from zero using a permutation test, and (3) the RvsDist test (implemented in pairwise component of LDHat) which determines the correlation between the R2 measure of linkage disequilibrium between site pairs with four gametes with the physical distance in nucleotides between the site pairs[25] and uses a permutation test to detect significant deviations from the expected degree of correlation in the absence of recombination. For both the MCL and RvsDist tests we used a minor allele frequency cutoff of 0.01.
Population level recombination detection
We analyzed the 27,806 genomes sequenced by the NYC PHL and Pandemic Response Lab (PRL) from specimens collected from NYC residents. These genomes were aligned to the Wuhan-Hu-1 reference genome (Genbank accession NC_045512.2; https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/) using MAFFT v7.453 (options --auto --keeplength -addfragments)[43].To determine whether there was any onward transmission of a major-minor strain recombinant, we used 3SEQ v.1.7[29] as a statistical test for recombination in the NYC data. 3SEQ interrogates triplets of sequences for signals of mosaicism in a sequence given two ‘parental’ sequences. We interrogated each of the 27,806 NYC PHL and PRL-generated genomes for mosaicism given the major and minor strains as parents. The resulting p-values are Dunn-Sidak corrected for multiple comparisons (n = 55,612), and we tested for mosaicism at p-value thresholds 0.05 and 0.25. The single nucleotide differences between a putative recombinant and the major and minor strains were visualized using snipit (https://github.com/aineniamh/snipit).
Authors: Leen Vijgen; Els Keyaerts; Elien Moës; Inge Thoelen; Elke Wollants; Philippe Lemey; Anne-Mieke Vandamme; Marc Van Ranst Journal: J Virol Date: 2005-02 Impact factor: 5.103
Authors: Ben Jackson; Maciej F Boni; Matthew J Bull; Amy Colleran; Rachel M Colquhoun; Alistair C Darby; Sam Haldenby; Verity Hill; Anita Lucaci; John T McCrone; Samuel M Nicholls; Áine O'Toole; Nicole Pacchiarini; Radoslaw Poplawski; Emily Scher; Flora Todd; Hermione J Webster; Mark Whitehead; Claudia Wierzbicki; Nicholas J Loman; Thomas R Connor; David L Robertson; Oliver G Pybus; Andrew Rambaut Journal: Cell Date: 2021-08-17 Impact factor: 41.582
Authors: Timothy E Schlub; Redmond P Smyth; Andrew J Grimm; Johnson Mak; Miles P Davenport Journal: PLoS Comput Biol Date: 2010-04-29 Impact factor: 4.475
Authors: Xianding Deng; Miguel A Garcia-Knight; Mir M Khalid; Venice Servellita; Candace Wang; Mary Kate Morris; Alicia Sotomayor-González; Dustin R Glasner; Kevin R Reyes; Amelia S Gliwa; Nikitha P Reddy; Claudia Sanchez San Martin; Scot Federman; Jing Cheng; Joanna Balcerek; Jordan Taylor; Jessica A Streithorst; Steve Miller; Bharath Sreekumar; Pei-Yi Chen; Ursula Schulze-Gahmen; Taha Y Taha; Jennifer M Hayashi; Camille R Simoneau; G Renuka Kumar; Sarah McMahon; Peter V Lidsky; Yinghong Xiao; Peera Hemarajata; Nicole M Green; Alex Espinosa; Chantha Kath; Monica Haw; John Bell; Jill K Hacker; Carl Hanson; Debra A Wadford; Carlos Anaya; Donna Ferguson; Phillip A Frankino; Haridha Shivram; Liana F Lareau; Stacia K Wyman; Melanie Ott; Raul Andino; Charles Y Chiu Journal: Cell Date: 2021-04-20 Impact factor: 41.582
Authors: Spyros Lytras; Joseph Hughes; Darren Martin; Phillip Swanepoel; Arné de Klerk; Rentia Lourens; Sergei L Kosakovsky Pond; Wei Xia; Xiaowei Jiang; David L Robertson Journal: Genome Biol Evol Date: 2022-02-04 Impact factor: 3.416
Authors: Michael Worobey; Jonathan Pekar; Brendan B Larsen; Martha I Nelson; Verity Hill; Jeffrey B Joy; Andrew Rambaut; Marc A Suchard; Joel O Wertheim; Philippe Lemey Journal: Science Date: 2020-09-10 Impact factor: 47.728
Authors: Chrispin Chaguza; Anne M Hahn; Mary E Petrone; Shuntai Zhou; David Ferguson; Mallery I Breban; Kien Pham; Mario A Peña-Hernández; Christopher Castaldi; Verity Hill; Wade Schulz; Ronald I Swanstrom; Scott C Roberts; Nathan D Grubaugh Journal: medRxiv Date: 2022-07-02